Fixing hallucinations from the source will be a tough one. The root of the issue is that loss doesn't discriminate. A probable guess will lower loss much better than "I don't know" or whatever equivalent. Educated guessing becomes an essential skill the model learns during initial training.
So the objective function encourages it. But also the dataset encourages it as well. There will be many many sentences that can't be completed accurate to source even with all the knowledge and understanding in the world. Many completions will have numerous sensible options. The dataset doesn't discriminate. Fiction, Fact, Opinion, Mistake. All the same. All given equal weight.
> A probable guess will lower loss much better than "I don't know" or whatever equivalent.
Guessing only reduces loss as much as the dataset allows -- a bad guess will give a higher loss. The model learns to assign probabilities to its guesses, just like we do. It seems to me all we need here is a measure of confidence for the result averaged over the entire answer. Low confidence is a guess/hallucination.
> But also the dataset encourages it as well. There will be many many sentences that can't be completed accurate to source even with all the knowledge and understanding in the world. Many completions will have numerous sensible options. The dataset doesn't discriminate. Fiction, Fact, Opinion, Mistake. All the same. All given equal weight.
This is an important issue but should be tackled as a distinctly different problem I think: it's the weighty concept of truth that humanity struggled with from day 1. Indeed, how do we discriminate? LLMs won't ever solve this via completions or dataset alone; instead successful models will use slow, step-by-step reasoning involving logical principles and rational heuristics in prompt space. Pretty much like we do.
My knowledge in this area is very limited, but based on the high level descriptions I've seen of how LLMs work (including the OP), it seems like it would be fairly trivial to output, along with each response, a "confidence factor" of some sort for that response. While that might cause confusion for some users, it could be incredibly valuable to differentiate between confident responses and guesses, as you say.
The continuation of the phrase “George Washington was born” could be multiple things. You get a probability for the next token selected (for example “in”) and a probability for the token after that (for example “Virginia”) and you can multiply them to get the probability of the “in Virgina” response but what does it mean? Maybe the probability is low becase “on February …” is more likely.
If the first token was “in” you could end up with “in Virginia in 1732” or “in 1732 in Virginia” and both responses are in some sense the same but the probability of each one doesn’t take that into account. Et cetera.
Yeah, I saw something similar in a reply to another comment. I don't think it would be quite as bad as that because it's not just completing the phrase in a vacuum though, but in the context of the prompt. So if the prompt was "where was GW born", then "in Virginia" would be much more likely than "in 1732". But I do understand that there would often be multiple ways to word the same thing, or multiple correct answers to the same prompt.
In the case of multiple wordings of the same thing, I wonder if there could be a way to determine closeness of responses, and consider them together when calculating confidence. As a simple example, if responses have the same rare words (like 1732) and differs only in the sentence order or the more common words ("in", etc.) used, those would be more similar than ones that used different rare words. So perhaps that could be accounted for.
As for multiple correct answers to the same prompt, I think that's fine. The confidence of a response might be low because it's one correct answer of many, or because the model has no idea and it's taking a wild-ass guess. But the user asking the question probably has an idea of whether what's being asked is very common knowledge or something obscure or controversial. At least much of the time. And even if the metric wasn't perfect, I still feel it could be useful.
Of course this is all the rambling of someone who doesn't really know anything about this stuff. You could just say I'm spitting out some likely tokens I guess; consider the confidence low.
You’re right, there are ways to tackle this problem but they may require some case-by-case effort to define what you are trying to find out and to incorporate information external to the model itself. Not fairly trivial :-)
Ha, I mean it would be fairly trivial to output "a confidence factor of some sort". It just becomes less trivial when you try to actually make it useful!
So you take the output, e.g. "George Washington was born in Virgina" and ask another prompt. Is the following true? Answer with a single word either true or false: "George Washington was born in Virgina". It will then output true/false with a probability, although for GPT-4 this is not available through the API.
Actually it's funny how you can ask the follow-up question "Are you sure?" and quite often GPT-4 will apologize and change a correct answer to give an incorrect one instead.
That's weird. Having the community study this would certainly help them. They're afraid this is giving too much insight into their proprietary training/modeling methods?
used to be really useful for detecting text written with the same model, as it was high probability... unfortunately the probabilities are messed up by RLHF.
The problem is that the models are already evaluating confidence on their answers and picking the best one... And that confidence is based on token generation....
Imagine the question "In which year was Donald Trump born?"
The LLM would start the answer by either:
"Donald Trump was born in ..."
Or
"I'm sorry I don't know"
And for the vast majority of answers the first option looks more "probable", so it starts producing tokens with an affirmative answer, and if the model eventually sees a bunch of low probability answers when it tries to produce the year, it's already "too late" to backtrack in a naive GPT implementation.
You could train LLM such that it responds with "I'm sorry I don't know" more often, but how do you predicate the response on "do this only if your 500B parameters don't encode the answer"? It requires self-referential logic on the model which isn't obvious to me how it would be done.
Maybe some smart people have figured this out, but I can see how this makes it really hard to do.
My understanding is that Backtracking isn't needed, sampling the network token at a time gives you the expected distribution over the token sequences too--
E.g. if you were to brute force expand out to depth "I'm sorry I don't know" and evaluate its probably relatively to all other strings you'd find that the probability of it is the same as you got sampling symbol at a time (though this isn't true if you do anything funny with your sampling).
The problem is actually that the distribution isn't the one you want, as it doesn't say I don't know enough. It's easy enough to graft on a beam search, just expand out every possibility, keep the best N and keep expanding them. But AFAIK it doesn't help.
Maybe this is less true for models which have been through RLHF, though.
Seems kinda tricky to train the right behavior here. Even if the input data contained "I don't know" (surely the internet doesn't, it's full of all us fking know it alls), it would contain I don't knows relative to the writer and not the model. So trying to push for it naively you just end up with models that say they don't know but when you ask them the same question in ROT13 they answer correctly. :P
Seems tricky for humans to learn too. Small children are fluent with english long before they're fluent in giving truthful responses. :)
I don't think this is the problem. The confidence of the best answer won't always be the same. Sometimes there would be one answer that's significantly better than others, whereas other times there could be a lot of mediocre answers it's picking between. So having it spit out the confidence along with the answer could theoretically be useful.
What would be a challenge is what others noted in reply, that sometimes there would be multiple good answers, so low confidence wouldn't necessarily be a sign of a poor answer. (Though I expect work could be done there.)
The reason humans tend to tell the truth is if we don’t, other humans will call us out for it.
I wonder if there’s a way to mimic this “bs penalty” for GPT. Maybe you could have a setup where GPT gives an answer, then a second GPT has to guess whether a human would know if that answer is true or not.
> It seems to me all we need here is a measure of confidence for the result averaged over the entire answer. Low confidence is a guess/hallucination.
Even if the model knows the exact answer to the question, there may be many distinct ways of phrasing the answer. This would also lead to low confidence in any particular phrasing.
That should be okay though, 10 good answers will still report the score of the best one chosen. I think the GPTs are using beam search which is projecting out a "beam" (looks more like a tree to me) of probable answers each of which has a score of accumulated token probabilities, and then just picking the highest.
In this case, it doesn't matter how wide the beam is or how many possible answers there are, the score is still the accumulated token possibilities of the best branch.
However, others have noted in the thread that RLHF might hurt this approach severely by scoring polite responses high regardless of false answers (for example). Then you have to access the model pre-RLHF to get any idea of its true likelihood.
Ah, interesting, that does begin to explain how this might be more difficult than it initially appears. Could there some way to define.. proximity of different possible responses, and sum the confidence for all the nearby possibilities?
The datasets are massive, and curating them into buckets of fact/fiction by hand would be next to impossible. Automating curation of the datasets with the latest and greatest GPT sounds very possible however. GPT-supervised learning could be the key to bootstrapping robust models, and I would be shocked if OpenAI isn't using GPT4 to filter their datasets as we speak.
"Automating curation of the datasets with the latest and greatest GPT sounds very possible however."
It might also lead to recursive garbage generation. If the AI has flaws and those flaws are from the initial training, then I see no way, how the flawed data can ever generate clean data.
"The datasets are massive, and curating them into buckets of fact/fiction by hand would be next to impossible. "
And it is not impossible. It is just a lot of work. And maybe work we just have to do, if we want to create reliable AIs, that do not fail at random times.
Now this alone might not completely remove hallucinations, but solid training data is just the base of it all.
(and since we are living in the area of fake news, I welcome all efforts towards established facts and data out of general principle)
Funny enough, in 2023, the option most likely to be viable for the impossible task of categorizing the training set would be to use a LLM. While I don't "trust" these AIs to give accurate information, it's probably within their capabilities to categorize by the above mentioned categories... then feed that back into another (very expensive) round of training, along with some theoretical developments to boot. I do think this is within the realm of possibility in the next ~1 year, but would be hard.
Oh, I surely think LLM's could help with the task of curation.
Maybe even spot lots of potential errors and flaws by themself, to get to the worst cases in the dataset faster. But to finally confirm or negate the actual data in question, there has to be at least one (not overworked) human in the loop (and many eyes would be better). Otherwise it will just reinforce the existing flaws.
I don’t think it’s just a matter of curating data sets for accuracy. The model seem to be able to invent falsehoods on its own based on no data whatsoever - e.g. I read that researchers asked an LLM about a made up biochemist and the LLM generated an entire fictional history for it.
An LLM has no concept of truth or falsehood, or facts, or logic, or anything else that matters for users. All it does is produce text that closely resembles (statistically speaking) the material it was trained on.
Training the model on "accurate" data isn't going to change the fact that you're operating at a completely different level of abstraction from the model -- one that it can't operate at, because that's simply not how it works.
This is the same problem that image models have. A human artist works with shapes, values, textures, colors, etc. An image model works with pixels, with zero higher-level abstractions or reasoning informing its output.
I see no reason why the image model couldn't work in a space of shapes and textures which are then mapped to pixels after the fact. Or even just leave its output in a raw vector based format. You could pick a different basis to work in.
Though I agree there is something ontological missing. All of these are flat basis. There's nothing spanning the recursive dimension. It can't draw a picture within a picture within a picture...to some specified depth n because it does not work with abstractions of "objects". The joke of Rene Margrit's "treachery of images" is lost on the AI.
My hope would be that by labeling factual and fictional data it would coax the model into a state where it doesn’t invent facts outside of a creative writing context. Inventing falsehoods is practically the definition of creative writing, you don’t want the model to lose that ability.
This is purely hypothetical, but I imagine that the internal mechanisms the model uses to invent stories out of thin air are different from the mechanisms the model uses to recall precise facts. Providing additional labels could help sharpen the division between creative and factual writing tasks.
The model can't "invent facts" because it doesn't know what facts are. It's a statistical model that encodes information about the text it was trained on. It has no higher level abstractions to inform its output.
> I imagine that the internal mechanisms the model uses to invent stories out of thin air are different from the mechanisms the model uses to recall precise facts.
Nope! If you input "What is the capital of the United States of America?", it's probably going to output the correct answer, because the question (and answer) probably appeared many times in the training data. If you input "What is the capital of the Gronk Republic?", whatever output you get is generated via the exact same mechanism.
That’s the “LLMs are a fuzzy jpeg of the web” theory, and it’s far, far from the scientific consensus.
No one knows for sure exactly what happens inside a 500B parameter model. I’ll just leave you with GPT4’s response to your question “What is the capital of the Gronk Republic”.
“As an AI language model, I am not aware of any existing country or political entity called the "Gronk Republic." It could be a fictional or hypothetical place, in which case the capital would be determined by the creator or context in which it is mentioned. If you are referring to a real location, please provide more information or clarify the name of the place you are asking about.”
Here's the thing, though. Gronk is Robert James Gronkowski, a football player and celebrity, playing mainly for the New England Patriots.
For that reason, I'd say the 'intelligent' answer would be a clever play on the notion that this celebrity was the ruler of a 'republic' of some sort, and associating some kind of location or thing to serve as the 'capital' of either a physical or conceptual 'republic', playing along with the gag.
So, you'd get 'Boston'.
Or you'd get 'Gillette Stadium… the END ZONE'.
(since I had to google all that, first answer is the biggest city representing the New England Patriots, and the second abstracts the 'republic' to be 'the area Gronk rules', depicting him as returning to his property by scoring points in football)
The thing is, this sort of wild associativeness IS 'intelligence' and it's simultaneously complete hallucination. It's valuable. Hallucination is potentially the most valuable thing an AI can do. Without it, you're doing nothing but regurgitating the work of others.
I work day in and day out at a job that demands I exploit associativeness, and that makes it not easy for AI to simply step in and replace me, but by the same token I can SPOT when it's moving in that direction, and I can exploit the abilities of AI (such as Stable Diffusion) to hallucinate and make wild associations, because I know the contexts in which that is useful. To handicap this is a bad mistake. You're asking AI to do the wrong things when you're asking it to be free of error. If it's perfectly free of error it's not intelligence anymore, it's a dataset.
It took me a minute and a little googling to track down who Gronk was, what he did, the rules of football, and why 'Gillette Stadium, the END ZONE' is actually a fantastic and intelligent answer to the question. It's a creative 'slipping' of the grounds of the question, in the absence of a literal answer, to provide a satisfying figurative answer that reframes the question in an unexpected way. When AI is able to do this, that will itself be a useful kind of intelligence… and we are already slightly there, without realizing it.
I wonder if GPT4 “traps” such questions and handles them using non-LLM algorithms. I mean Google already does it for that type of question, e.g. “capital of the irish republic” will return Dublin, Ireland. “Gronk” returns result about some (famous?) guy nicknamed Gronk.
Have you tried “arguing” with it and try to gaslight it by insisting that it’s non-fictional?
Over time I realized that the strength of GPT is not really in fact finding, because there will always be a limitation with it's training dataset, it's strength is literally in language. One of the most useful things I've gotten ChatGPT to do is to build a friends resume going into tech, being able to write a very unprofessional sentence like: "did some react, css and js coding" and just tell it to "make it sound more professional with metrics" and it spits out a perfect bulleted list in the exact structure that every article on resumes of the past 10 years told you to do.
Writing marketing copies is another place I really found it useful, almost as if I was sitting with a marketing professional telling them "I want to say this: ...." and they return me a perfect marketing copy.
I don't think the solution to GPT is to "fix" the hallucinations, rather it would be to educate users on what is and isn't possible with it. I think a similar thing happened with Siri when it first came out, people thought it was magic at first but very quickly we all learned what it's good (and not good) at.
It would not be hard to generate a lot of false statements from structured data. Like if I know Michael Jackson was born on Aug 29, I can generate “Michael Jackson was born on July 5”. And you could also pair them with true examples which have similar characteristics. Can we use examples like this in the training process to teach the model not to hallucinate?
But Michael Jackson was born on April 19th[0]! As well as March 27th[1], and many other days[2].
"Michael Jackson was born on August 29th" is the most likely answer to contextless queries like "When was Michael Jackson born?", but that does not make structurally identical sentences with different information false, merely less probable to be contextually correct.
So the objective function encourages it. But also the dataset encourages it as well. There will be many many sentences that can't be completed accurate to source even with all the knowledge and understanding in the world. Many completions will have numerous sensible options. The dataset doesn't discriminate. Fiction, Fact, Opinion, Mistake. All the same. All given equal weight.