They state in their report that they filter evaluation data off their training data, see p.3 - Filtering:
"Further, we filter all evaluation sets from our pre-training data mixture, run targeted contamination analyses to check against evaluation set leakage, and reduce the risk of recitation by minimizing proliferation of sensitive
outputs."
According to their paper, average of standard task of Mistral is 54.0 and for Gemma it's 56.4, so 4.4% relative better. Not as big as you would expect for the company which invented transformers and probably has 2-3 order more compute for training it vs few month old French startup.
Also for note on their human evaluations, Gemma 7B IT has a
51.7% win rate against Mistral v0.2 7B Instruct.
A caveat: my impression of Phi-2, based on my own use and others’ experiences online, is that these benchmarks do not remotely resemble reality. The model is a paper tiger that is unable to perform almost any real-world task because it’s been fed so heavily with almost exclusively synthetic data targeted towards improving benchmark performance.
Fun that's not my experience of Phi-2. I use it for non-creative context, but function calling, and I find as reliable as much bigger models (no fine-tuning just constraining JSON + CoT). Phi-2 unquantized vs Mixtral Q8, Mixtral is not definitely better but much slower and RAM-hungry.
What prompts/settings do you use for Phi-2? I found it completely unusable for my cases. It fails to follow basic instructions (I tried several instruction-following finetunes as well, in addition to the base model), and it's been mostly like a random garbage generator for me. With Llama.cpp, constrained to JSON, it also often hangs because it fails to find continuations which satisfy the JSON grammar.
I'm building a system which has many different passes (~15 so far). Almost every pass is a LLM invocation, which takes time. My original idea was to use a smaller model, such as Phi-2, as a gateway in front of all those passes: I'd describe which pass does what, and then ask Phi-2 to list the passes which are relevant for the user query (I called it "pass masking"). That would save a lot of time and collapse 15 steps to 2-3 steps on average. In fact, my Solar 10.7B model does it pretty well, but it takes 7 seconds for the masking pass to work on my GPU. Phi-2 would finish in ~1 second. However, I'm really struggling with Phi-2: it fails to reason (what's relevant and what's not), unlike Solar, and it also refuses to follow the output format (so that I could parse the output programmatically and disable the irrelevant passes). Again, my proof of concept works with Solar, and fails spectacularly with Phi-2.
> You are a helpful assistant to 'User'. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'. 'System' will give you data. Do not respond as 'System'. Allow yourself inner thoughts as 'Thoughts'.
and then I constrain its answers to Thoughts: [^\n]* and Assistant: <JSON schema>, and I have two shots included in the prompt.
I haven't been able to get anything useful out of Phi-2 in llama.cpp (but I only tried quantized models). I use python/huggingface's transformers lib instead.
An update on my endeavour: so, model switching is very costly under llama.cpp (I have to switch between Llama and Phi2 because my GPU has low amounts of VRAM). And this switch (reloading the weights into VRAM) defeats the whole purpose of the optimization. Having only Llama on GPU without reloading takes less time than if I'd use Llama+Phi2. And Phi2 alone is pretty bad as a general purpose LLM. So I'm quite disappointed.
I recently upgraded to AM5 and as I have an AMD GPU I'm using llama.cpp on CPU only and I was positively surprised by how fast it generate stuff. I don't have the case of massive workloads so YMMV.
Hear hear! I don't understand why it has persistent mindshare, it's not even trained for chat. Meanwhile StableLM 3B runs RAG in my browser, on my iPhone, on my Pixel ..
Idea is usage-based charging for non-local and a $5/month sub for syncing.
keep an eye on @jpohhhh on Twitter if you're interested
now that I got it on web, I'm hoping to at least get a PoC up soon. I've open-sourced the consitutent parts as FONNX and FLLAMA, Flutter libraries that work on all platforms. FONNX has embeddings, FLLAMA has llama.
Miqu was (allegedly) an internal continued pretrain Mistral did as a test, that was leaked as a GGUF.
Maybe its just semantics, it is technically a finetune... But to me theres a big difference between expensive "continuation training" (like Solar 10.7B or Mistral 70B) and a much less intense finetuning. The former is almost like releasing a whole new base model.
It would be awesome if Mistral did that with their data, but thats very different than releasing a Gemma Instruct finetune.
There’s typically a difference in LR between a ‘continued pretrain’ and ‘fine tune.’ I don’t have the details around miqu, but was merely trying to say that Mistral could produce a better version of these models than the OSS community might. If the size of the corpora they use means we are no longer in fine tuning territory, then okay.
Honestly, this is more of a PR stunt to advertise the Google Dev ecosystem than a contribution to open-source. I'm not complaining, just calling it what it is.
Barely an improvement over the 5-month-old Mistral model, with the same context length of 8k. And this is a release after their announcement of Gemini Pro 1.5, which had an exponential increase in context length.
It's more like a masterclass in corporate doublespeak. Google’s "transparency" is as clear as mud, with pretraining details thinner than their privacy protections. Diving into Google’s tech means auctioning off your privacy (and your users' privacy) to the highest bidder.
Their "open source" embrace is more of a chokehold, with their tech biases and monopolistic strategies baked into every line of code. Think of it as Google's way of marking territory - every developer is a fire hydrant.
These megacorps aren’t benevolent patrons of open source; they're self-serving giants cloaking power grabs under the guise of "progress".
Use these products at your own risk. If these companies wanted to engage in good faith, they'd use Apache or MIT licensing and grant people the agency and responsibility for their own use and development of software. Their licenses are designed to mitigate liability, handcuff potential competitors, and eke every last drop of value from users, with informed consent frequently being an optional afterthought.
That doesn't even get into the Goodharting of metrics and actual performance of the models; I highly doubt they're anywhere near as good as Mistral.
If it’s not Apache or MIT, (or even some flavor of GPL,) it’s not open source; it’s a trojan horse. These "free" models come at the cost of your privacy and freedoms.
These models aren't Open or Open Access or Free unless you perform the requisite mental gymnastics cooked up by their marketing and legal teams. Oceania has always been at war with Eastasia. Gemma is doubleplusgood.
You said a lot of nothing without actually saying specifically what the problem is with the recent license.
Maybe the license is fine for almost all usecases and the limitations are small?
For example, you complained about metas license, but basically everyone uses those models and is completely ignoring it. The weights are out there, and nobody cares what the fine print says.
Maybe if you are a FAANG, company, meta might sue. But everyone else is getting away with it completely.
I specifically called out the claims of openness and doublespeak being used.
Google is making claims that are untrue. Meta makes similar false claims. The fact that unspecified "other" people are ignoring the licenses isn't relevant. Good for them. Good luck making anything real or investing any important level of time or money under those misconceptions.
"They haven't sued yet" isn't some sort of validation. Anyone building an actual product that makes actual money that comes to the attention of Meta or Google will be sued into oblivion, their IP taken, and repurposed or buried. These tech companies have never behaved otherwise, and to think that they will is willfully oblivious.
They don't deserve the benefit of the doubt, and should be called out for using deceitful language, making comparisons between their performative "openness" and actual, real, open source software. Mistral and other players have released actually open models and software. They're good faith actors, and if you're going to build a product requiring a custom model, the smart money is on Mistral.
FAANG are utilizing gotcha licenses and muddying the waters to their own benefit, not as a contribution to the public good. Building anything on the assumption that Meta or Google won't sue is beyond foolish. They're just as open as "Open"AI, which is to say not open at all.
> Anyone building an actual product that makes actual money that comes to the attention of Meta or Google will be sued into oblivion
No they won't and they haven't.
Almost the entire startup scene is completely ignoring all these licenses right now.
This is basically the entire industry. We are all getting away with it.
Here's an example, take llama.
Llama originally disallowed commercial activity. But then the license got changed much later.
So, if you were a stupid person, then you followed the license and fell behind. And if you were smart, you ignored it and got ahead of everyone else.
Which, in retrospect was correct.
Because now the license allows commerical activity, so everyone who ignores it in the first place got away with it and is now ahead of everyone else.
> won't sue is beyond foolish
But we already got away with it with llama! That's already over! It's commerical now, and nobody got sued! For that example, the people who ignored the license won.