Very impressive looking! Just wanted to caution it's worth being a bit skeptical...

tome · on Feb 19, 2024

As a fellow scientist I concur with the approach of skepticism by default. Our chat app and API are available for everyone to experiment with and compare output quality with any other provider.

I hope you are enjoying your time of having an empty calendar :)

mr_luc · on Feb 19, 2024

Wait you have an API now??? Is it open, is there a waitlist? I’m on a plane but going to try to find that on the site. Absolutely loved your demo, been showing it around for a few months.

tome · on Feb 19, 2024

There is an API and there is a waitlist. Sign up at http://wow.groq.com/

bsima · on Feb 19, 2024

As tome mentioned we don’t quantize, all activations are FP16

And here are some independent benchmarks https://artificialanalysis.ai/models/llama-2-chat-70b

xvector · on Feb 19, 2024

Jesus Christ, these speeds with FP16? That is simply insane.

throwawaymaths · on Feb 19, 2024

Ask how much hardware is behind it.

modeless · on Feb 19, 2024

All that matters is the cost. Their price is cheap, so the real question is whether they are subsidizing the cost to achieve that price or not.

noir_lord · on Feb 20, 2024

> All that matters is the cost.

Not really, sustainability matters, if they are the only game in town, you want to know that game isn't going to end suddenly when their runway turns into a brick wall.

modeless · on Feb 20, 2024

Cost, not price.

throwawaymaths · on Feb 20, 2024

The point of asking how much hardware is to estimate the cost? (Both capital and operational, i.e. power)

sp332 · on Feb 19, 2024

At least for the earlier Llama 70B demo, they claimed to be running unquantized. https://twitter.com/lifebypixels/status/1757619926360096852

Update: This comment says "some data is stored as FP8 at rest" and I don't know what that means. https://news.ycombinator.com/item?id=39432025

tome · on Feb 19, 2024

The weights are quantized to FP8 when they're stored in memory, but all the activations are computed at full FP16 precision.

youssefabdelm · on Feb 19, 2024

Can you explain if this affects quality relative to fp16? And is mixtral quantized?

tome · on Feb 19, 2024

We don't think so, but you be the judge! I believe we quantize both Mixtral and Llama 2 in this way.

a_wild_dandan · on Feb 19, 2024

Is your confidence rooted in quantified testing, or just vibes? I'm sure you're right, just curious. (My reasoning: running inference at full fp16 is borderline wasteful. You can use q7 with almost no loss.)

monkmartinez · on Feb 20, 2024

I know some fancy benchmark says "almost no loss", but... subjectively, there is a clear quality loss. You can try for yourself, I can run Mixtral at 5.8bpw and there is an OBVIOUS difference between what I have seen from Groq and my local setup beside the sound barrier shattering speed of Groq. I didn't know Mixtral could output such nice code and I have used it A LOT locally.

doctorpangloss · on Feb 20, 2024

Yes, but this gray area underperformance that lets them claim they are the cheapest and fastest appeals to people for whom qualitative (aka real) performance doesn’t matter.

tome · on Feb 19, 2024

What quantified testing would you like to see? We've had a lot of very good feedback from our users, particularly about Mixtral.

bearjaws · on Feb 19, 2024

Nothing really wrong with FP8 IMO, it performs pretty damn well usually within 98% while significantly reducing memory usage.

Gcam · on Feb 19, 2024

As part of our benchmarking of Groq we have asked Groq regarding quantization and they have assured us they are running models at full FP-16. It's a good point and important to check.

Link to benchmarking: https://artificialanalysis.ai/ (Note question was regarding API rather than their chat demo)

losvedir · on Feb 19, 2024

Maybe I'm stretching the analogy too far, but are we in the transistor regime of LLMs already? Sometimes I see these 70 billion parameter monstrosities and think we're still building ENIAC out of vacuum tubes.

In other words, are we ready to steadily march on, improving LLM tok/s year by year, or are we a major breakthrough or two away before that can even happen?

binary132 · on Feb 19, 2024

The thing is that tokens aren't an apples to apples metric.... Stupid tokens are a lot faster than clever tokens. I'd rather see token cleverness improving exponentially....

behnamoh · on Feb 19, 2024

tangent: Great to see you again on HN!