Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

SOTA models are reportedly MoE, not dense.


A 5T MoE model is still bottlenecked by streaming weights from SSD, in addition to compute bottlenecks during prefill and decode.


True but a cluster built on pipeline parallelism can naturally stream from multiple SSD's in parallel. That probably makes offload somewhat more effective. And you also have RAM caching available as a natural possibility.


You won't be RAM caching much of anything with experts that are 220b parameters worth of layers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: