Very impressive looking! Just wanted to caution it's worth being a bit skeptical without benchmarks as there are a number of ways to cut corners. One prominent example is heavy model quantization, which speeds up the model at a cost of model quality. Otherwise I'd love to see LLM tok/s progress exactly like CPU instructions/s did a few decades ago.
As a fellow scientist I concur with the approach of skepticism by default. Our chat app and API are available for everyone to experiment with and compare output quality with any other provider.
I hope you are enjoying your time of having an empty calendar :)
Wait you have an API now??? Is it open, is there a waitlist? I’m on a plane but going to try to find that on the site. Absolutely loved your demo, been showing it around for a few months.
Not really, sustainability matters, if they are the only game in town, you want to know that game isn't going to end suddenly when their runway turns into a brick wall.
Is your confidence rooted in quantified testing, or just vibes? I'm sure you're right, just curious. (My reasoning: running inference at full fp16 is borderline wasteful. You can use q7 with almost no loss.)
I know some fancy benchmark says "almost no loss", but... subjectively, there is a clear quality loss. You can try for yourself, I can run Mixtral at 5.8bpw and there is an OBVIOUS difference between what I have seen from Groq and my local setup beside the sound barrier shattering speed of Groq. I didn't know Mixtral could output such nice code and I have used it A LOT locally.
Yes, but this gray area underperformance that lets them claim they are the cheapest and fastest appeals to people for whom qualitative (aka real) performance doesn’t matter.
As part of our benchmarking of Groq we have asked Groq regarding quantization and they have assured us they are running models at full FP-16. It's a good point and important to check.
Maybe I'm stretching the analogy too far, but are we in the transistor regime of LLMs already? Sometimes I see these 70 billion parameter monstrosities and think we're still building ENIAC out of vacuum tubes.
In other words, are we ready to steadily march on, improving LLM tok/s year by year, or are we a major breakthrough or two away before that can even happen?
The thing is that tokens aren't an apples to apples metric.... Stupid tokens are a lot faster than clever tokens. I'd rather see token cleverness improving exponentially....