Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Super interesting that they chose 671B and 7B. no like 32B which feels like a "sweet spot"


Likely because they haven't got their own suitable SoTA base models of any other size to build on. DeepSeek V3 is 671B, and DeepSeek-Prover-v1.5 [1] is 7B only, based on DeepSeekMath which is 7B, which is based on DeepSeekCoder-Base-7B-v1.5. Maybe DeepSeek-Coder-V2 (16B and 236B) would be a good start but it's merged into DeepSeek V2.5, and V2.5 is inferior to V3. Or some version of Qwen.

[1] https://github.com/deepseek-ai/DeepSeek-Prover-V1.5


Also notable is the earliest planning for a positive reception release of a new model might include both parameter-based and skill type market segmentation.

--> "In an increasingly crowded field of LLMs, how will our (costly to produce) model stand out?"


I feel like this is very logical way to do things. Test hypothesis on small model, play around, get it working, apply findings to big model.


or they did and it wasn't sweet ? (no idea but seems they would before redacting a publication)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: