It's pretty common now for data scientists and ML engineers to validate the quality of their training data being fed into these LLMs, but what about their test data used to evaluate them?
I spent some time playing around with the FLAN-T5 open-source LLM from Google Research and I discovered that noisy test/evaluation data can actually cause you to choose sub-optimal prompts.
Given two prompts A and B, I found multiple cases where prompt A performed better on the observed (noisy) test data, yet worse on the high-quality test data. In reality, this means that you would choose A as the "best prompt" when prompt B is actually the better one. I also proved the accuracy difference to be significant via McNemar’s Test.
This article explains my methodology and how I used data-centric AI to automatically clean the noisy test data in order to ensure optimal prompt selection.
Most LLMs allow you to specify the temperature parameter that governs the randomness and thus the creativity of the responses. For this experiment I used a very low temperature to ensure consistency for a given prompt.
Looks to be an OpenAI problem due to how they host the API. People are speculating that different servers in their pool are ot always on the exact same version of the model, and you can get pointed to a different version...no proof I saw, but quite a few people complaining about it.
All the local models I've used were deterministic with temp 0.
That isn't what I was replying to though. Factuality is different than determinism. They are not trained to retain facts. They are trained to predict the next word.
I spent some time playing around with the FLAN-T5 open-source LLM from Google Research and I discovered that noisy test/evaluation data can actually cause you to choose sub-optimal prompts.
Given two prompts A and B, I found multiple cases where prompt A performed better on the observed (noisy) test data, yet worse on the high-quality test data. In reality, this means that you would choose A as the "best prompt" when prompt B is actually the better one. I also proved the accuracy difference to be significant via McNemar’s Test.
This article explains my methodology and how I used data-centric AI to automatically clean the noisy test data in order to ensure optimal prompt selection.