I hope this dataset is used to probe the wrong answers much more than try to get all the answers correct. I don't need LLMs to know everything, but I do need them to know what they don't know. That's not a capability that comes naturally to a probabilistic sampling of tokens.
The ideal 'progress' on this benchmark is for a model to remove incorrect answers and replace them with I don't know answers. Even if it hurts the correct answer count a bit I'd gladly make that tradeoff for a model that hallucinated far less often.
>That's not a capability that comes naturally to a probabilistic sampling of tokens.
The linked page from OpenAI clearly show the opposite, read the "Using SimpleQA to measure the calibration of large language models" and the paper linked: https://arxiv.org/abs/2207.05221
The probabilistic sampling of tokens does not naturally produce introspective evaluation of confidence, it enforces a highest probability token selection (in greedy sampling). The paper that you linked demonstrates that if a separate evaluation phase is allowed then a model can decide with some accuracy whether its previous statements were true. This is not the behavior we want out of a system as it involves 1. Production of potentially misleading output 2. The identification of factual statements within that output 3. The classification of each of those statements and 4. Restatement of the original output without factual errors. The research area that I am advocating for would aim to prevent 1 not mask it.
The ideal 'progress' on this benchmark is for a model to remove incorrect answers and replace them with I don't know answers. Even if it hurts the correct answer count a bit I'd gladly make that tradeoff for a model that hallucinated far less often.