What do they mean by instruction? Is it just regular LLM?

isaacfung · on March 13, 2023

LLM just predicts the next token given the previous tokens(this can be trained without manual labelling by humans).

Instruct GPT and ChatGPT use reinforcement learning from human feedback to align the model with human intents so it understands instructions.

https://huggingface.co/blog/rlhf

sanxiyn · on March 14, 2023

Note that Alpaca is NOT using RLHF. It explicitly states it used supervised finetuning.

est · on March 14, 2023

It says

> We train the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003

Which leads to self-instruct https://github.com/yizhongw/self-instruct

From a glimpse they used a LM to classify instructions & train the model which IMHO is very similar to RLHF

sanxiyn · on March 14, 2023

No, it is not RLHF because there is no reward model involved. See also OpenAI's explanation here: https://platform.openai.com/docs/model-index-for-researchers

bilsbie · on March 13, 2023

Thanks. So what does the output look like without rlhf?

valine · on March 13, 2023

It can look like anything. Sometimes it will answer your questions, other times it will continue the question like its the one asking. I've also seen it randomly output footers and copyright like it just got to the end of a webpage.

Its makes sense when you think about how the training data is random text on the internet. Sometimes the most likely next token is the end of a webpage after an unanswered question.

simonw · on March 13, 2023

This comment has a useful comparison between the two: https://news.ycombinator.com/item?id=35140447