Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is it a common practice in LLMs to give different weights to different training data sources?

For instance I might want to say that all training data that comes from my inhouse emails take precedence over anything that comes from the internet?



Yes it is. IIRC back when OpenAI was open and they published the breakdown they were significantly overweighting wikipedia.


Yes, though it's not about taking precedence but about sampling frequency. So for example, if you have 1 GB emails, 10 GB external data, you can instead sample your emails twice as much and effectively change the ratio of what the model was trained on from 1:10 to 2:10.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: