OpenpPipe - https://openpipe.ai/ - is probably the service that most closely resembles what you’re asking for, but I found the evals weren’t really what I wanted — i.e. following my custom evaluation criteria — so you probably will end up having to do that yourself anyway. But for the finetuning, they’re all somewhat the same. Predibase and OpenPipe are two good options for that. Predibase has more base models for you to finetune, but it’s a bit more unwieldy to work with. I wrote about that in a previous post here -- https://mlops.systems/posts/2024-06-17-one-click-finetuning.....
(Disclaimer: founder of OpenPipe). Thanks for the shout-out. Note that we're actively working on improved evaluations that will let you add more specific criteria as well as more evaluation types, like comparing field values to that of a golden dataset. This is definitely something that customers are asking for!
Wild to see them advertising collecting GPT4 responses for training other models. That’s definitely not allowed by TOS. I suspect many do, but front page advertising is another thing entirely.
Predibase ( http://predibase.com ), also referred in the article, is a platform specifically designed for exactly that. It also has "repos" for finetuning multiple models and comapre their performance and keeping things organzie. It also allow you to query any of the finetuned models on the fly from a single GPU with multi-lora serving. (Predibase founder here)
When jumping from 7B parameters to 70B to 400B (or whatever GPT-4 uses) most of the additional neurons seem to go towards a better world model and better reasoning (or whatever you want to call the inference of new information from known information). There doesn't seem to be any major improvements in basic language skills past 7B, and even 1B and 3B models do pretty well on that front.
In that sense it's not that surprising that on a pure text extraction task with little "thinking" required a 7B model does well and outperforms other models after fine tuning. In the "noshotsfired" label GPT-4 is even accused of overthinking it.
It is interesting how finetuned mistral-7b and llama3-7b outperform finetuned gpt3.5-turbo. I would tend to attribute that to those models being newer and "more advanced" despite their low parameter count, but maybe that's interpreting too much into a small score difference.
That’s still the point. That model now does exactly one thing, and because of that can do better than a model 50x the size that tries to do everything. It will crush it in instruction following and consistency.
A fine tuned 500b parameter model would probably beat the fine tuned 7b model, but only by a bit (depending on task obviously). A lot of that capacity is being used for knowledge, and isn’t needed for extraction/classification tasks. Fine tuning isn’t touching most of those weights. The smaller models need to focus on more general language skills, not answering “describe the evolution of France’s economy in the 1800s”.
Still good to see someone walk through their fine tuning process, with a mix of hosted and local options.