I hope this dataset is used to probe the wrong answers much more than try to get...

GaggiX · on Oct 31, 2024

>That's not a capability that comes naturally to a probabilistic sampling of tokens.

The linked page from OpenAI clearly show the opposite, read the "Using SimpleQA to measure the calibration of large language models" and the paper linked: https://arxiv.org/abs/2207.05221

iandanforth · on Oct 31, 2024

The probabilistic sampling of tokens does not naturally produce introspective evaluation of confidence, it enforces a highest probability token selection (in greedy sampling). The paper that you linked demonstrates that if a separate evaluation phase is allowed then a model can decide with some accuracy whether its previous statements were true. This is not the behavior we want out of a system as it involves 1. Production of potentially misleading output 2. The identification of factual statements within that output 3. The classification of each of those statements and 4. Restatement of the original output without factual errors. The research area that I am advocating for would aim to prevent 1 not mask it.