Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection Case Study (cleanlab.ai)
66 points by cmauck10 on June 29, 2023 | hide | past | favorite | 9 comments


It's pretty common now for data scientists and ML engineers to validate the quality of their training data being fed into these LLMs, but what about their test data used to evaluate them?

I spent some time playing around with the FLAN-T5 open-source LLM from Google Research and I discovered that noisy test/evaluation data can actually cause you to choose sub-optimal prompts.

Given two prompts A and B, I found multiple cases where prompt A performed better on the observed (noisy) test data, yet worse on the high-quality test data. In reality, this means that you would choose A as the "best prompt" when prompt B is actually the better one. I also proved the accuracy difference to be significant via McNemar’s Test.

This article explains my methodology and how I used data-centric AI to automatically clean the noisy test data in order to ensure optimal prompt selection.


Given that LLMs fail to give consistent answers to the same questions, how does that factor into these studies?


Most LLMs allow you to specify the temperature parameter that governs the randomness and thus the creativity of the responses. For this experiment I used a very low temperature to ensure consistency for a given prompt.


We run temp 0 for our prod app and see different results for the same prompt all the time. It’s not every prompt, but the longer ones.


You can have an LLM be entirely deterministic. This is by design how they work. Having the randomness added to their response is a choice.


Maybe you can help me make this more deterministic? I am already setting temp to 0, but i keep getting different responses to the same prompt.

https://gist.github.com/derwiki/99079de4cfcb4f196b1ca561b3d7...


Looks to be an OpenAI problem due to how they host the API. People are speculating that different servers in their pool are ot always on the exact same version of the model, and you can get pointed to a different version...no proof I saw, but quite a few people complaining about it.

All the local models I've used were deterministic with temp 0.


They can be deterministic, but that doesn’t mean the information they are regurgitating is correct.


That isn't what I was replying to though. Factuality is different than determinism. They are not trained to retain facts. They are trained to predict the next word.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: