Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even with search grounding, it scored a 2.5/5 on a basic botanical benchmark. It would take much longer for the average human to do a similar write-up, but they would likely do better than 50% hallucination if they had access to a search engine.


Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.


Training for tasks still works petty well, but “vision” is a super broad domain and most seem optimized for OCR and screen processing (which have verifiable outputs and relatively straightforward data generation)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: