Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection Case Study

cmauck10 · on June 29, 2023

It's pretty common now for data scientists and ML engineers to validate the quality of their training data being fed into these LLMs, but what about their test data used to evaluate them?

I spent some time playing around with the FLAN-T5 open-source LLM from Google Research and I discovered that noisy test/evaluation data can actually cause you to choose sub-optimal prompts.

Given two prompts A and B, I found multiple cases where prompt A performed better on the observed (noisy) test data, yet worse on the high-quality test data. In reality, this means that you would choose A as the "best prompt" when prompt B is actually the better one. I also proved the accuracy difference to be significant via McNemar’s Test.

This article explains my methodology and how I used data-centric AI to automatically clean the noisy test data in order to ensure optimal prompt selection.

neeleshs · on June 29, 2023

Given that LLMs fail to give consistent answers to the same questions, how does that factor into these studies?

cmauck10 · on June 29, 2023

Most LLMs allow you to specify the temperature parameter that governs the randomness and thus the creativity of the responses. For this experiment I used a very low temperature to ensure consistency for a given prompt.

derwiki · on June 30, 2023

We run temp 0 for our prod app and see different results for the same prompt all the time. It’s not every prompt, but the longer ones.

Tostino · on June 30, 2023

You can have an LLM be entirely deterministic. This is by design how they work. Having the randomness added to their response is a choice.

derwiki · on June 30, 2023

Maybe you can help me make this more deterministic? I am already setting temp to 0, but i keep getting different responses to the same prompt.

https://gist.github.com/derwiki/99079de4cfcb4f196b1ca561b3d7...

Tostino · on June 30, 2023

Looks to be an OpenAI problem due to how they host the API. People are speculating that different servers in their pool are ot always on the exact same version of the model, and you can get pointed to a different version...no proof I saw, but quite a few people complaining about it.

All the local models I've used were deterministic with temp 0.

svaha1728 · on June 30, 2023

They can be deterministic, but that doesn’t mean the information they are regurgitating is correct.

Tostino · on June 30, 2023

That isn't what I was replying to though. Factuality is different than determinism. They are not trained to retain facts. They are trained to predict the next word.