Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In general yes, bench mark pollution is a big problem and why only dynamic benchmarks matter.


This is true, but how would pollution work for a benchmark designed to test hallucinations?


A dataset of labelled answers that are hallucinations and not hallucinations are published based on the benchmark as part of a paper.

People _seriously_ underestimate just how much stuff is online and how much impact it can have on training.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: