Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What do they mean by instruction? Is it just regular LLM?


LLM just predicts the next token given the previous tokens(this can be trained without manual labelling by humans).

Instruct GPT and ChatGPT use reinforcement learning from human feedback to align the model with human intents so it understands instructions.

https://huggingface.co/blog/rlhf


Note that Alpaca is NOT using RLHF. It explicitly states it used supervised finetuning.


It says

> We train the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003

Which leads to self-instruct https://github.com/yizhongw/self-instruct

From a glimpse they used a LM to classify instructions & train the model which IMHO is very similar to RLHF


No, it is not RLHF because there is no reward model involved. See also OpenAI's explanation here: https://platform.openai.com/docs/model-index-for-researchers


Thanks. So what does the output look like without rlhf?


It can look like anything. Sometimes it will answer your questions, other times it will continue the question like its the one asking. I've also seen it randomly output footers and copyright like it just got to the end of a webpage.

Its makes sense when you think about how the training data is random text on the internet. Sometimes the most likely next token is the end of a webpage after an unanswered question.


This comment has a useful comparison between the two: https://news.ycombinator.com/item?id=35140447




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: