Looks like Groq (at 1k+ tokens/second) and Fireworks are already live on openrou...

podnami · 2025-08-05T17:23:48 1754414628

Wow this was actually blazing fast. I prompted "how can the 45th and 47th presidents of america share the same parents?"

On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.

swores · 2025-08-05T17:35:42 1754415342

I'm not sure that's a particularly good question for concluding something positive about the "thought for 0.7 seconds" - it's such a simple answer, ChatGPT 4o (with no thinking time) immediately answered correctly. The only surprising thing in your test is that o3 wasted 13 seconds thinking about it.

Workaccount2 · 2025-08-05T17:41:19 1754415679

A current major outstanding problem with thinking models is how to get them to think an appropriate amount.

dingnuts · 2025-08-05T19:31:01 1754422261

The providers disagree. You pay per token. Verbacious models are the most profitable. Have fun!

willy_k · 2025-08-06T01:50:53 1754445053

For API users, yes, but for the average person with a subscription or using the free tier it’s the inverse.

conradkay · 2025-08-06T02:24:40 1754447080

Nowadays it must be pretty large % of usage going through monthly subscriptions

nisegami · 2025-08-05T17:37:54 1754415474

Interesting choice of prompt. None of the local models I have in ollama (consumer mid range gpu) were able to get it right.

golergka · 2025-08-05T18:32:19 1754418739

When I pay attention to o3 CoT, I notice it spends a few passes thinking about my system prompt. Hard to imagine this question is hard enough to spend 13 seconds on.

Imustaskforhelp · 2025-08-05T17:31:14 1754415074

Not gonna lie but I got sorta goosebumps

I am not kidding but such progress from a technological point of view is just fascinating!

xpe · 2025-08-05T19:41:28 1754422888

How many people are discussing this after one person did 1 prompt with 1 data point for each model and wrote a comment?

What is being measured here? For end-to-end time, one model is:

t_total = t_network + t_queue + t_batch_wait + t_inference + t_service_overhead

tekacs · 2025-08-05T17:53:24 1754416404

I apologize for linking to Twitter, but I can't post a video here, so:

https://x.com/tekacs/status/1952788922666205615

Asking it about a marginally more complex tech topic and getting an excellent answer in ~4 seconds, reasoning for 1.1 seconds...

I am _very_ curious to see what GPT-5 turns out to be, because unless they're running on custom silicon / accelerators, even if it's very smart, it seems hard to justify not using these open models on Groq/Cerebras for a _huge_ fraction of use-cases.

tekacs · 2025-08-05T17:53:57 1754416437

Cleanshot link for those who don't want to go to X: https://share.cleanshot.com/bkHqvXvT

tekacs · 2025-08-05T17:56:59 1754416619

A few days ago I posted a slowed-down version of the video demo on someone's repo because it was unreadably fast due to being sped up.

https://news.ycombinator.com/item?id=44738004

... today, this is a real-time video of the OSS thinking models by OpenAI on Groq and I'd have to slow it down to be able to read it. Wild.

sigmar · 2025-08-05T17:27:02 1754414822

Non-rhetorically, why would someone pay for o3 api now that I can get this open model from openai served for cheaper? Interesting dynamic... will they drop o3 pricing next week (which is 10-20x the cost[1])?

[1] currently $3M in/ $8M out https://platform.openai.com/docs/pricing

gnulinux · 2025-08-05T17:31:37 1754415097

Not even that, even if o3 being marginally better is important for your task (let's say) why would anyone use o4-mini? It seems almost 10x the price and same performance (maybe even less): https://openrouter.ai/openai/o4-mini

Invictus0 · 2025-08-05T20:42:46 1754426566

Probably because they are going to announce gpt 5 imminently

gnulinux · 2025-08-05T17:30:23 1754415023

Wow, that's significantly cheaper than o4-mini which seems to be on part with gpt-oss-120b. ($1.10/M input tokens, $4.40/M output tokens) Almost 10x the price.

LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.

tempaccount420 · 2025-08-05T17:42:42 1754415762

It's funny because I was thinking the opposite, the pricing seems way too high for a 5B parameter activation model.

gnulinux · 2025-08-05T17:45:04 1754415904

Sure you're right, but if I can squeeze out o4-mini level utility out of it, but its less than quarter the price, does it really matter?

wahnfrieden · 2025-08-05T19:50:57 1754423457

mikepurvis · 2025-08-05T17:33:55 1754415235

Are the prices staying aligned to the fundamentals (hardware, energy), or is this a VC-funded land grab pushing prices to the bottom?

spott · 2025-08-05T17:46:07 1754415967

It is interesting that openai isn't offering any inference for these models.

bangaladore · 2025-08-05T17:54:33 1754416473

Makes sense to me. Inference on these models will be a race to the bottom. Hosting inference themselves will be a waste of compute / dollar for them.

modeless · 2025-08-05T18:53:55 1754420035

I really want to try coding with this at 2600 tokens/s (from Cerebras). Imagine generating thousands of lines of code as fast as you can prompt. If it doesn't work who cares, generate another thousand and try again! And at $.69/M tokens it would only cost $6.50 an hour.

andai · 2025-08-06T00:45:41 1754441141

I tried this (gpt-oss-120b with Cerebras) with Roo Code. It repeatedly failed to use the tools correctly, and then I got 429 too many requests. So much for the "as fast as I can think" idea!

I'll have to try again later but it was a bit underwhelming.

The latency also seemed pretty high, not sure why. I think with the latency the throughout ends up not making much difference.

Btw Groq has the 20b model at 4000 TPS but I haven't tried that one.