https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87 IMO...

		jbentley1 on April 14, 2025 \| parent \| context \| favorite \| on: GPT-4.1 in the API https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance.