"The term 'hallucination' does anthropomorphize LLMs"
It does not as hallucinations are not only something humans experience. Beyond that it is now an accepted term of art used to describe a specific behavior exhibited by an LLM that is separate from the biological one.
The problem I have with the term is that we already have one that describes much more accurately what these models are doing: it's called guessing. Guessing is simply reporting information one does not know to be true. As a model does not have data points regarding certain information, each token it returns is done so with lower and lower confidence. It's literally guessing. But since we aren't exposed to the confidence score of the completion it's taken to be full confidence, when it is not the case.
That framing fails to describe the case where the model is confident in a response (at the token-level), and is wrong, which I think is still considered hallucinating.
Misconceptions. There's no inherent reason a false statement would have lower probability than a true one.
To be clear, I'm referring to things like GPT-3.5 reportedly consistently messing up on statements like "what's heavier, two pounds of feathers or a pound of bricks". Being consistently wrong in the same way implies to me (but I don't know for sure) that the class of response is high probability in an absolute sense.
I can't find the article that demonstrated the sort of things that GPT consistently gets wrong, but it was things like common misconceptions and sayings.
Very interesting. So it could produce, with high confidence, common and real-world guesses found in it's dataset.
So in that case it's not guessing and not wrong; it's indeed producing something that is correct, but still false. Now we're really getting into the weeds here though.
It does not as hallucinations are not only something humans experience. Beyond that it is now an accepted term of art used to describe a specific behavior exhibited by an LLM that is separate from the biological one.
The problem I have with the term is that we already have one that describes much more accurately what these models are doing: it's called guessing. Guessing is simply reporting information one does not know to be true. As a model does not have data points regarding certain information, each token it returns is done so with lower and lower confidence. It's literally guessing. But since we aren't exposed to the confidence score of the completion it's taken to be full confidence, when it is not the case.