Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).