Does anyone know of any good test suites we can use to benchmark these local mod...

paxys · on March 29, 2023

The measure of a "good" model is still very subjective. OpenAI has used stuff like standardized test scores to compare the latest iterations of GPT, but that is only one of the many possible objective measures and might not be relevant in a lot of cases. Maybe we'll come to a consensus around such a methodology soon, or maybe it'll be something every user has to judge on their own depending on their goals.

pmoriarty · on March 29, 2023

It's even harder to measure when what you value are subjective things like creativity.

aiappreciator · on March 29, 2023

The simplest and quickest benchmark is to do a rap battle between GPT-4 and the local models. Copy paste the responses between them to enable the cross-model battle.

It is instantly clear how strong the model is relative to GPT-4.

Filligree · on March 29, 2023

Have you tried it? How did it do?

pcthrowaway · on March 29, 2023

Someone here did Bard v GPT-4 a few days ago, and GPT-4 mopped the floor with Bard: https://news.ycombinator.com/item?id=35252612

IIAOPSW · on March 29, 2023

You're talking to the model right now.

sdrinf · on March 29, 2023

Test suites are not reflection complete! https://sdrinf.com/reflection-completeness -essentially, the moment a set of testing data gets significant traction, it becomes a target to optimize for.

Instead, I strongly recommend to put together a list of "control questions" of your own, that covers the general, and specific use cases you're interested in. Specifically, I'd recommend adding questions on topics you have high degree of expertise on; and topics where you can figure out what "expert" answer actually looks like; then run it against the available models by yourself.

sebzim4500 · on March 29, 2023

>Test suites are not reflection complete!

This is true of all the existing NLP benchmarks but I don't see why it should be true in general. In machine vision, for example, benchmarks like ImageNet were still useful even when people were trying to optimize directly for them. (ImageNet shows its age now but that's because it's too easy).

I hope we can come up with something similarly robust for language. It can't just be a list of 1000 questions, otherwise it will end up in the training data and everyone will overfit to it.

For example, would it be possible to generate billions of trivia questions from WikiData? Good luck overfitting on that.

astrange · on March 29, 2023

You can try holding a tournament with it vs other models if you can think of a game for them to play.

meghan_rain · on March 29, 2023

Idea: Create a set of tests, that AI experts vet, but are kept secret. New models are run against them and only the scores are published.

singularity2001 · on March 29, 2023

https://github.com/openai/evals ?

drexlspivey · on March 29, 2023

https://github.com/openai/evals