Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Keen to hear more about this benchmark. Is it representative of chat-to-document style usecases with big docs?


Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).

[1] https://arxiv.org/pdf/2409.12640

[2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...


Update: Gemini 2.5 also crushes fiction.livebench


"MRCR (multi-round coreference resolution)" for those looking for the link to Michaelangelo




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: