o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.
deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.
4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.
o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.
claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.
gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.
Perplexity: discontinued subscription once the search functionality in other models improved.
I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.
I really loved Phind and always think of it as the OG perplexity / RAG search engine.
Sadly stopped my subscription, when you removed the ability to weight my own domains...
Otherwise the fine-tune for your output format for technical questions is great, with the options, the pro/contra and the mermaid diagrams. Just way better for technical searches, than what all the generic services can provide.
Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”
And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.
But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7
Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.
In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.
In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.
My experience is it often produces terrible diagrams. Things clearly overlap, lines make no sense. I'm not surprised as if you told me to layout a diagram in XML/YAML there would be obvious mistakes and layout issues.
I'm not really certain a text output model can ever do well here.
FWIW I think a multimodal model could be trained to do extremely well with it given sufficient training data. A combination of textual description of the system and/or diagram, source code (mermaid, SVG, etc.) for the diagram, and the resulting image, with training to translate between all three.
Had the same experience. o3-mini failing misreably, claude 3.7 as well, but gemini 2.5 pro solved it perfectly. (image of diagram without source to tikz diagram)
I've had mixed and inconsistent results and it hasn't been able to iterate effectively when it gets close. Could be that I need to refine my approach to prompting. I've tried mermaid and SVG mostly, but will also try graphviz based on your suggestion.
You probably know this and are looking for consistency but, a little trick I use
is to feed the original data of what I need as a diagram and to re-imagine, it as an image “ready for print” - not native, but still a time saver and just studying with unstructured data or handles this surprisingly well. Again not native…naive, yes. Native, not yet. Be sure to double check triple check as always. give it the ol’ OCD treatment.
re: "grok-3 is r1 with mods" -- do you mean you believe they distilled deepseek r1? that was my assumption as well, though i thought it more jokingly at first it would make a lot of sense. i actually enjoy grok 3 quite a lot, it has some of the most entertaining thinking traces.
o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.
deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.
4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.
o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.
claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.
gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.
Perplexity: discontinued subscription once the search functionality in other models improved.
I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.