Keen to hear more about this benchmark. Is it representative of chat-to-document...

sebzim4500 · 2025-03-25T19:42:32 1742931752

Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).

[1] https://arxiv.org/pdf/2409.12640

[2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...

sebzim4500 · 2025-03-25T23:15:23 1742944523

Update: Gemini 2.5 also crushes fiction.livebench

swyx · 2025-03-25T22:47:17 1742942837

"MRCR (multi-round coreference resolution)" for those looking for the link to Michaelangelo