Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM (20x faster than HBM3, just to be clear). Which means you need ~256 LPUs (4 full server racks of compute, each unit on the rack contains 8x LPUs and there are 8x of those units on a single rack) just to serve a single model [1] where as you can get a single H200 (1/256 of the server rack density) and serve these models reasonably well.

It might work well if you have a single model with lots of customers, but as soon as you need more than a single model and a lot of finetunes/high rank LoRAs etc., these won't be usable. Or for any on-prem deployment since the main advantage is consolidating people to use the same model, together.

[0]: https://wow.groq.com/groqcard-accelerator/

[1]: https://twitter.com/tomjaguarpaw/status/1759615563586744334



Groq Engineer here, I'm not seeing why being able to scale compute outside of a single card/node is somehow a problem. My preferred analogy is to a car factory: Yes, you could build a car with say only one or two drills, but a modern automated factory has hundreds of drills! With a single drill, you could probably build all sorts of cars, but a factory assembly line is only able to make specific cars in that configuration. Does that mean that factories are inefficient?

You also say that H200's work reasonably well, and that's reasonable (but debatable) for synchronous, human interaction use cases. Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.


Just curious, how does this work out in terms of TCO (even assuming the price of a Groq LPU is 0$)? What you say makes sense, but I'm wondering how you strike a balance between massive horizontal scaling vs vertical scaling. Sometimes (quite often in my experience) having a few beefy servers is much simpler/cheaper/faster than scaling horizontally across many small nodes.

Or I got this completely wrong, and your solution enables use-cases that are simply unattainable on mainstream (Nvidia/AMD) hardware, making TCO argument less relevant?


We're providing by far the lowest latency LLM engine on the planet. You can't reduce latency by scaling horizontally.


Distributed, shared memory machines used to do exactly that in HPC space. They were a NUMA alternative. It works if the processing plus high-speed interconnect are collectively faster than the request rate. The 8x setups with NVLink are kind of like that model.

You may have meant that nobody has a stack that uses clustering or DSM with low-latency interconnects. If so, then that might be worth developing given prior results in other low-latency domains.


> Distributed, shared memory machines used to do exactly that in HPC space.

reformed HPC person here.

Yes, but not latency optimised in the case here. HPC is normally designed for throughput. Accessing memory from outside your $locality is normally horrifically expensive, so only done when you can't avoid it.

For most serving cases, you'd be much happier having a bunch of servers with a number of groqs in them, than managing a massive HPC cluster and trying to keep it both up and secure. The connection access model is much more traditional.

Shared memory clusters are not really compatible with secure enduser access. It is possible to partition memory access, but its something thats not off the shelf (well that might have changed recently.) Also, shared memory means shared fuckups.

I do get what you're hinting at, but if you want to serve low latency, high compute "messages" then discrete "APU" cards are a really good way to do it simply (assuming you can afford it). HPCs are fun, but its not fun trying to keep them up with public traffic on them


It would probably be a cluster of thin nodes with GPU’s or low-cost accelerators over a low-latency interconnect. The DSM would be layered on top of that. The AI cluster would handle processing with security, etc done more by other components. They’re usually layered.

I agree it’s harder to manage with less, fine-grained security. People were posting Groq chips at $20k each, though. With that, we’re talking whether the management of it is worth it for installations costing six or more digits. That might be more justifiable if an alternative saves them a good chunk of six or more digits.

Their main advantage is a solution that’s ready to go :)


I think existing players will have trouble developing a low latency solution like us whilst they are still running on non-deterministic hardware.


While you’re here, I have a quick, off-topic question. We‘ve seen incredible results with GPT3-176B (Davinci) and GPT4 (MoE). Making attempts at open models that reuse their architectural strategies could have a high impact on everyone. Those models took 2500-25000 GPU’s to train, though. It would be great to have a low-cost option for pre training Davinci-class models.

It would great if a company or others with AI hardware were willing to do production runs of chips sold at cost specifically to make open, permissive-licensed models. As in, since you’d lose profit, the cluster owner and users would be legally required to only make permissive models. Maybe at least one in each category (eg text, visual).

Do you think your company or any other hardware supplier would do that? Or someone sell 2500 GPU’s at cost for open models?

(Note to anyone involved in CHIPS Act: please fund a cluster or accelerator specifically for this.)


Great idea, but Groq doesn't have a product suitable for training at the moment. Our LPUs shine in inference.


What do you mean by non-deterministic hardware? cuBLAS on a laptop GPU was deterministic when I tried it last iirc


Tip of the ice-berg.

DRAM needs to be refreshed every X cycles.

This means you don't know the time it takes to read from memory. You could be reading at a refresh cycle. This circuitry also adds latency.


OP says SRAM, which doesn't decay so no refreshing.


Timing can simply mean the FETs that make up the logic circuits of a chip. The transition from high to low and low to high has a minimum safe time to register properly...


Non-deterministic timing characteristics.


> 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

I believe that this is doable - my pipeline is generally closer to 400ms without RAG and with Mixtral, with a lot of non-ML hacks to get there. It would also definitely be doable with a joint speech-language model that removes the transcription step.

For these use cases, time to first byte is the most important metric, not total throughput.


It’s important…if you’re building a chatbot.

The most interesting applications of LLMs are not chatbots.


> The most interesting applications of LLMs are not chatbots.

What are they then? Every use case I’ve seen is either a chatbot or like a copy editor which is just a long form chatbot.


Obviously not op, but these days LLMs can be fuzzy functions with reliably structured output, and are multi-modal.

Think about the implications of that. I bet you can come up with some pretty cool use cases that don't involve you talking to something over chat.

One example:

I think we'll be seeing a lot of "general detectors" soon. Without training or predefined categories, get pinged when (whatever you specify) happens. Whether it's a security camera, web search, event data, etc


Complex data tagging/enrichment tasks.


> The most interesting applications of LLMs are not chatbots.

In your opinion, what are the most interesting?


> Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia

I built one, should be live soon ;-)


Exciting! Looking forward to seeing it.


I have one, with 13B, on a 5-year-old 48GB Q8000 GPU. It’s also can see, it’s LLaVA. And it is very important that it is local, as privacy is important and streaming images to the cloud is time consuming.

You only need a few tokens, not the full 500 tokens response to run TTS. And you can pre-generate responses online, as ASR is still in progress. With a bit of clever engineering the response starts with virtually no delay, the moment its natural to start the response.


Did you find anything cheaper for local installation?


>Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

Is your version of that on a different page from this chat bot?


You can’t scale horizontally forever because of communication. I think HBM would provide a lot more flexibility with the number of chips you need.


Are there voice responses in the demo? I couldn't find em?


Here's a live demo of CNN of Groq plugged into a voice API

https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s


Thanks, that's pretty impressive. I suppose with blazing fast token generation now things like diarisation and the actual model are holding us back.

Once it flawlessly understands when it is being spoken to/if it should speak based on the topic at hand (like we do) then it'll be amazing.

I wonder if ML models can feel that feeling of wanting to say something so bad but having to wait for someone else to stop talking first ha ha.


Wow! Absolutely astounding!


Hi Matanyal, we worked with groq for a project last year. would you be open to connect on LinkedIn? :)


Groq states in this article [0] that they used 576 chips to achieve these results, and continuing with your analysis, you also need to factor in that for each additional user you want to have requires a separate KV cache, which can add multiple more gigabytes per user.

My professional independent observer opinion (not based on my 2 years of working at Groq) would have me assume that their COGS to achieve these performance numbers would exceed several million dollars, so depreciating that over expected usage at the theoretical prices they have posted seems impractical, so from an actual performance per dollar standpoint they don’t seem viable, but do have a very cool demo of an insane level of performance if you throw cost concerns out the window.

[0]: https://www.nextplatform.com/2023/11/27/groq-says-it-can-dep...


Thomas, I think for full disclosure you should also state that you left Groq to start a competitor (a competitor which doesn't have the world's lowest latency LLM engine nor a guarantee to match the cheapest per token prices, like Groq does.).

Anyone with a serious interest in the total cost of ownership of Groq's system is welcome to email contact@groq.com.


I thought that was clear through my profile, but yes, Positron AI is focused on providing the best performance per dollar while providing the best quality of service and capabilities rather than just focusing on a single metric of speed.

A guarantee to match the cheapest per token prices is sure a great way to lose a race to the bottom, but I do wish Groq (and everyone else trying to compete against NVIDIA) the greatest luck and success. I really do think that the great single batch/user performance by Groq is a great demo, but is not the best solution for a wide variety of applications, but I hope it can find its niche.


I think that just means it’s for people that really want it?

John doe and his friends will never have a need to have their fart jokes generated at this speed, and are more interested in low costs.

But we’d recently been doing call center operations and being able to quickly figure out what someone said was a major issue. You kind of don’t want your system to wait for a second before responding each time. I can imagine it making sense if it reduces the latency to 10ms there as well. Though you might still run up against the ‘good enough’ factor.

I guess few people want to spend millions to go from 1000ms to 10ms, but when they do they really want it.


What happened to Rex? Did it hit production or get abandoned?

It was also on my list of things to consider modifying for an AI accelerator. :)


Long story, but technically REX is still around but has not been able to continue to develop due to lack of funding and my cofounder and I needing to pay bills. We produced initial test silicon, but due to us having very little money after silicon bringup, most of our conversations turned to acquihire discussions.

There should be a podcast release (https://microarch.club/) in the near future that covers REX's history and a lot of lessons learned.


If you want low latency you have to be really careful with HBM, not only because of the delay involved, but also the non-determinacy. One of the huge benefits of our LPU architecture is that we can build systems of hundreds of chips with fast interconnect and we know the precise timing of the whole system to within a few parts per million. Once you start integrating non-deterministic components your latency guarantees disappear very quickly.


I don't know about HBM specifically, but DDR and GDDR at a protocol level are both deterministic. It's the memory controller doing a bunch of reordering that makes them non-deterministic. Presumably, if that is the reason you don't like DRAM, you could build your compiler to be memory-layout aware and have the memory controller issue commands without reordering.


That could be possible. It's out of my area of expertise so I can't say for sure. My understanding was HBM forces on you specific access patterns and non-deterministic delays. Our compiler already deals with many other forms of resource-aware scheduling so it could take into account DRAM refreshes easily, so I feel like there must be something else that makes SRAM more suitable in our case. I'll have to leave that to someone more knowledgeable to explain though ...


Presumably with dram you also have to worry about refreshes, which can come along at arbitrary times relative to the workload.


You can control when those happen, too.


not without affecting performance though? If you delay refreshes, this lowers performance as far as I remember...


Control of all of this can come at a performance cost, but in the case of DRAM refreshes, it doesn't lower performance if you don't do them, it loses data. Nominally, you could do your refreshes closer together and as long as you know that the rows being refreshed will be idle and you have spare time on the bus, you're ok.


From a theoretical perspective, this is absolutely not true. Asynchronous logic can achieve much lower latency guarantees than synchronous logic.

Come to think of it, this is one of the few places where asynchronous logic might be more than academic... Async logic is hard with complex control flows, which deep learning inference does not have.

(From a practical perspective, I know you were comparing to independently-clocked logic, rather than async logic)


(Groq Employee) You're right - we are comparing to independently-clocked logic.

I wonder whether async logic would be feasible for reconfigurable "Spatial Processor" type architectures [1]. As far as LPU architectures go, they fall in the "Matrix of Processing Engines"[1] family of architectures, which I would naively guess is not the best suited to leverage async logic.

1: I'm using the "Spatial Processor" (7:14) and "Matrix of Processing Engines" (8:57) terms as defined in https://www.youtube.com/watch?v=LUPWZ-LC0XE. Sorry for a video link, I just can't think of another single reference that explains the two approaches.


Curiously, almost all of this video is mostly covered by computer architectures lit in the late 90's early 00's. At the time, I recall Tom Knight had done most of the analysis in this video, but I don't know if he ever published it. It was extrapolating into the distant future.

To answer your questions:

- Spatial processors are an insanely good fit for async logic

- Matrix of processing engines are a moderately good fit -- definitely could be done, but I have no clue if it'd be a good idea.

In SP, especially in an ASIC, each computation can start as soon as the previous one finishes. If you have a 4-bit layer, and 8-bit layer, and a 32-bit layer, those will take different amounts of time to run. Individual computations can take different amounts of time too (e.g. an ADD with a lot of carries versus one with just a few). In an SP, a compute will take as much time as it needs, and no more.

Footnote: Personally, I think there are a lot of good ideas in 80's era and earlier processors for the design of individual compute units which have been forgotten. The basic move in architectures up through 2005 was optimizing serial computation speed at the cost of power and die size (Netburst went up to 3.8GHz two decades ago). With much simpler old-school compute units, we can have *many* more of them than a modern multiply unit. Critically, they could be positioned closer to the data, so there would be less data moving around. Especially the early pipelined / scalar / RISC cores seem very relevant. As a point of reference, a 4090 has 16k CUDA cores running at just north of 2GHz. It has the same number of transistors as 32,000 SA-110 processors (running at 200MHz on a 350 nanometer process in 1994).

TL;DR: I'm getting old and either nostalgic or grumpy. Dunno which.


This was sort of the dream of KNL but today I noticed

    Xeon Phi CPUs support (a.k.a. Knight Landing and Knight Mill) are marked as deprecated. GCC will emit a warning when using the -mavx5124fmaps, -mavx5124vnniw, -mavx512er, -mavx512pf, -mprefetchwt1, -march=knl, -march=knm, -mtune=knl or -mtune=knm compiler switches. Support will be removed in GCC 15.
the issue was that coordinating across this kind of hierarchy wasted a bunch of time. If you already knew how to coordinate, mostly, you could instead get better performance

you might be surprised but we're getting to the point that communicating over a super computer is on the same order of magnitude as talking across a numa node.


I actually wasn't so much talking from that perspective, as simply from the perspective of the design of individual pieces. There were rather clever things done in e.g. older multipliers or adders or similar which, I think, could apply to most modern parallel architectures, be that GPGPU, SP, MPE, FPGA, or whatever, in order to significantly increase density at a cost of slightly reduced serial performance.

For machine learning, that's a good tradeoff.

Indeed, with some of the simpler architectures, I think computation could be moved into the memory itself, as long dreamed of.

(Simply sticking 32,000 SA-110 processors on a die would be very, very limited by interconnect; there's a good reason for the types of architectures we're seeing not being that)


Truth is that there is another startup called graph core that is doing exactly that, and also a really big chip


I assume no one will read this, but good places to look for super-clever ways to reduce transistor count while maintaining good performance:

- Early mainframes / room-sized computers (era of vacuum tubes and discrete transistors), especially at the upper-end , where there was enough budget to have modern pipelined and scalar architectures.

- Cray X-MP and successors

- DEC Alpha / StrongARM (referenced SA-110)

Bad places to look are all the microcode architectures. These optimized transistor count, often sacrificing massive amounts of performance in order to save on cost. Ditto for some of the minicomputers, where the goal was to make an "affordable" computer. Something like the PDP was super-clever in cost-cutting, which made sense at the time, does much less to maintain performance.

There's a ton of long-forgotten cleverness.


They do what you were talking about, not what I was.

They seem annoying. "The IPU has a unique memory architecture consisting of large amounts of In-Processor-Memory™ within the IPU made up of SRAM (organised as a set of smaller independent distributed memory units) and a set of attached DRAM chips which can transfer to the In-Processor-Memory via explicit copies within the software. The memory contained in the external DRAM chips is referred to as Streaming Memory™."

There's a ™ every few words. Those seem like pretty generic terms. That's their technical documentation.

The architecture is reminiscent of some ideas from circa-2000 which didn't pan out. It reminds me of Tilera (the guy who ran it was the Donald Trump of computer architectures; company was acquihired by EZchip for a fraction of the investment which was put into it, which went to Mellanox, and then to NVidia).


Sweet, thanks! It seems like this research ecosystem was incredibly rich, but Moore's law was in full swing, and statically known workloads weren't useful at the compute scale of back then.

So these specialized approach never stood a chance next to CPUS. Nowadays the ground is.. more fertile.


Lots of things were useful to compute.

The problem was

1) If you took 3 years longer to build a SIMD architecture than Intel to make a CPU, Intel would be 4x faster by the time you shipped.

2) If, as a customer, I was to code to your architecture, and it took me 3 more years to do that, by that point, Intel would be 16x faster

And any edge would be lost. The world was really fast-paced. Groq was founded in 2016. It's 2024. If it was still hayday of Moore's Law, you'd be competing with CPUs running 40x as fast as today's.

I'm not sure you'd be so competitive against a 160GHz processor, and I'm not sure I'd be interested knowing a 300+GHz was just around the corner.

Good ideas -- lots of them -- lived in academia, where people could prototype neat architectures on ancient processes, and benchmark themselves to CPUs of yesteryear from those processes.


Surely once you're scaling over multiple chips/servers/racks you're dealing with retries and checksums and sequence numbers anyway? How do you get around the non-determinacy of networking beyond just hoping that you don't see any errors?


Our interconnect between chips is also deterministic! You can read more about our interconnect, synchronisation, and error correction in our paper.

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...


Groq devices are really well set up for small-batch-size inference because of the use of SRAM.

I'm not so convinced they have a Tok/sec/$ advantage at all, though, and especially at medium to large batch sizes which would be the groups who can afford to buy so much silicon.

I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1, and Nvidia cards do get meaningfully higher throughput as batch size gets into the 100's.


(Groq Employee) It's hard to discuss Tok/sec/$ outside of the context of a hardware sales engagement.

This is because the relationship between Tok/s/u, Tok/s/system, Batching, and Pipelining is a complex one that involves compute utilization, network utilization, and (in particular) a host of compilation techniques that we wouldn't want to share publicly. Maybe we'll get to that level of transparency at some point, though!

As far as Batching goes, you should consider that with synchronous systems, if all the stars align, Batch=1 is all you need. Of course, the devil is in the details, and sometimes small batch numbers still give you benefits. But Batch 100's generally gives no advantages. In fact, the entire point of developing deterministic hardware and synchronous systems is to avoid batching in the first place.


    I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1
I guess if you don't have any extra junk you can pack more processing into the chip?


(Groq Employee) Yes! Determinism + Simplicity are superpowers for ALU and interconnect utilization rates. This system is powered by 14nm chips, and even the interconnects aren't best in class.

We're just that much better at squeezing tokens out of transistors and optic cables than GPUs are - and you can imagine the implications on Watt/Token.

Anyways.. wait until you see our 4nm. :)


I've been thinking the same but on the other hand, that would mean they are operating at a huge loss which doesn't scale


> more than a single model and a lot of finetunes/high rank LoRAs

I can imagine a way might be found to host a base model and a bunch of LoRA's whilst using barely more ram than the base model alone.

The fine-tuning could perhaps be done in such a way that only perhaps 0.1% of the weights are changed, and for every computation the difference is computed not over the weights, but of the output layer activations.


This actually already exists! We did a writeup of the relevant optimizations here: https://openpipe.ai/blog/s-lora


There's also papers for hosting full-parameter fine-tuned models: https://arxiv.org/abs/2312.05215

Disclaimer: I'm one of the authors.


I recall a recent discussion about a technique to load the diff in weights between a lora and base model, zip it and transfer it on a per-needs basis.


>The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM [...]

IDGAF about any of that, lol. I just want an API endpoint.

480 tokens/sec at $0.27 per million tokens? Sign me in, I don't care about their hardware, at all.


there are providers out there offering for $0 per million tokens, that doesn't mean it is sustainable and won't disappear as soon as the VC well runs dry. Am not saying this is the case for Groq, but in general you probably should care if you want to build something serious on top of anything.


(Groq Employee) Agreed, one should care, and especially since this particular service is very differentiated by its speed and has no competitors.

That being said, until there's another option at anywhere that speed.. That point is moot, isn't it :)

For now, Groq is the only option that can let you build an UX with near-instant response times. Or a live agents that help with a human-to-human interaction. I could go on and on about the product categories this opens.


Why go so fast? Aren't Nvidias products fast enough from a TPS perspective?


OpenAI have a voice powered chat mode in their app and there's a noticeable delay of a few seconds between finishing your sentence and the bot starting to speak.

I think the problem is that for realistic TTS you need quite a few tokens because the prosody can be affected by tokens that come a fair bit further down the sentence, consider the difference in pitch between:

"The war will be long and bloody"

vs

"The war will be long and bloody?"

So to begin TTS you need quite a lot of tokens, which in turn means you have to digest the prompt and run a whole bunch of forward passes before you can start rendering. And of course you have to keep up with the speed of regular speech, which OpenAI sometimes struggles with.

That said, the gap isn't huge. Many apps won't need it. Some use cases where low latency might matter:

- Phone support.

- Trading. Think digesting a press release into an action a few seconds faster than your competitors.

- Agents that listen in to conversations and "butt in" when they have something useful to say.

- RPGs where you can talk to NPCs in realtime.

- Real-time analysis of whatever's on screen on your computing device.

- Auto-completion.

- Using AI as a general command prompt. Think AI bash.

Undoubtably there will be a lot more though. When you give people performance, they find ways to use it.


You've got good ideas. What I like to personally say is that Groq makes the "Copilot" metaphor real. A copilot is supposed to be fast enough to keep up with reality and react live :)


Hi foundval, can we connect on Linkedin please? :


I honestly don't see the problem.

"just to serve a single model" could be easily fixed by adding a single LPDDR4 channel per LPU. Then you can reload the model sixty times per second and serve 60 different models per second.


per-chip compute is not the main thing this chip innovates for fast inference, it is the extremely fast memory bandwith. when you do that, you'll loose all of that and will be much worse off than any off the shelf accelerators.


load model, compute a 1k token response (ie, do a thousand forward passes in sequence, one per token), load a different model, compute a response,

I would expect the model loading to take basically zero percent of the time in the above workflow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: