Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Very impressive looking! Just wanted to caution it's worth being a bit skeptical without benchmarks as there are a number of ways to cut corners. One prominent example is heavy model quantization, which speeds up the model at a cost of model quality. Otherwise I'd love to see LLM tok/s progress exactly like CPU instructions/s did a few decades ago.


As a fellow scientist I concur with the approach of skepticism by default. Our chat app and API are available for everyone to experiment with and compare output quality with any other provider.

I hope you are enjoying your time of having an empty calendar :)


Wait you have an API now??? Is it open, is there a waitlist? I’m on a plane but going to try to find that on the site. Absolutely loved your demo, been showing it around for a few months.


There is an API and there is a waitlist. Sign up at http://wow.groq.com/


As tome mentioned we don’t quantize, all activations are FP16

And here are some independent benchmarks https://artificialanalysis.ai/models/llama-2-chat-70b


Jesus Christ, these speeds with FP16? That is simply insane.


Ask how much hardware is behind it.


All that matters is the cost. Their price is cheap, so the real question is whether they are subsidizing the cost to achieve that price or not.


> All that matters is the cost.

Not really, sustainability matters, if they are the only game in town, you want to know that game isn't going to end suddenly when their runway turns into a brick wall.


Cost, not price.


The point of asking how much hardware is to estimate the cost? (Both capital and operational, i.e. power)


At least for the earlier Llama 70B demo, they claimed to be running unquantized. https://twitter.com/lifebypixels/status/1757619926360096852

Update: This comment says "some data is stored as FP8 at rest" and I don't know what that means. https://news.ycombinator.com/item?id=39432025


The weights are quantized to FP8 when they're stored in memory, but all the activations are computed at full FP16 precision.


Can you explain if this affects quality relative to fp16? And is mixtral quantized?


We don't think so, but you be the judge! I believe we quantize both Mixtral and Llama 2 in this way.


Is your confidence rooted in quantified testing, or just vibes? I'm sure you're right, just curious. (My reasoning: running inference at full fp16 is borderline wasteful. You can use q7 with almost no loss.)


I know some fancy benchmark says "almost no loss", but... subjectively, there is a clear quality loss. You can try for yourself, I can run Mixtral at 5.8bpw and there is an OBVIOUS difference between what I have seen from Groq and my local setup beside the sound barrier shattering speed of Groq. I didn't know Mixtral could output such nice code and I have used it A LOT locally.


Yes, but this gray area underperformance that lets them claim they are the cheapest and fastest appeals to people for whom qualitative (aka real) performance doesn’t matter.


What quantified testing would you like to see? We've had a lot of very good feedback from our users, particularly about Mixtral.


Nothing really wrong with FP8 IMO, it performs pretty damn well usually within 98% while significantly reducing memory usage.


As part of our benchmarking of Groq we have asked Groq regarding quantization and they have assured us they are running models at full FP-16. It's a good point and important to check.

Link to benchmarking: https://artificialanalysis.ai/ (Note question was regarding API rather than their chat demo)


Maybe I'm stretching the analogy too far, but are we in the transistor regime of LLMs already? Sometimes I see these 70 billion parameter monstrosities and think we're still building ENIAC out of vacuum tubes.

In other words, are we ready to steadily march on, improving LLM tok/s year by year, or are we a major breakthrough or two away before that can even happen?


The thing is that tokens aren't an apples to apples metric.... Stupid tokens are a lot faster than clever tokens. I'd rather see token cleverness improving exponentially....


tangent: Great to see you again on HN!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: