Does anyone know of any good test suites we can use to benchmark these local models? It would be really interesting to compare all the ones capable of running on consumer hardware so that users can easily choose the best ones to use. Currently, I'm a bit unsure how this compares to the Alpaca model released a few weeks ago.
The measure of a "good" model is still very subjective. OpenAI has used stuff like standardized test scores to compare the latest iterations of GPT, but that is only one of the many possible objective measures and might not be relevant in a lot of cases. Maybe we'll come to a consensus around such a methodology soon, or maybe it'll be something every user has to judge on their own depending on their goals.
The simplest and quickest benchmark is to do a rap battle between GPT-4 and the local models. Copy paste the responses between them to enable the cross-model battle.
It is instantly clear how strong the model is relative to GPT-4.
Test suites are not reflection complete! https://sdrinf.com/reflection-completeness -essentially, the moment a set of testing data gets significant traction, it becomes a target to optimize for.
Instead, I strongly recommend to put together a list of "control questions" of your own, that covers the general, and specific use cases you're interested in. Specifically, I'd recommend adding questions on topics you have high degree of expertise on; and topics where you can figure out what "expert" answer actually looks like; then run it against the available models by yourself.
This is true of all the existing NLP benchmarks but I don't see why it should be true in general. In machine vision, for example, benchmarks like ImageNet were still useful even when people were trying to optimize directly for them. (ImageNet shows its age now but that's because it's too easy).
I hope we can come up with something similarly robust for language. It can't just be a list of 1000 questions, otherwise it will end up in the training data and everyone will overfit to it.
For example, would it be possible to generate billions of trivia questions from WikiData? Good luck overfitting on that.