Benchmarks for Gemma 7B seem to be in the ballpark of Mistral 7B +-------------+...

sa-code · on Feb 21, 2024

Thank you. I thought it was weird for them to release a 7B model and not mention Mistral in their release.

mochomocha · on Feb 21, 2024

The technical report (linked in the 2nd paragraph of the blog post) mentions it, and compares against it: https://storage.googleapis.com/deepmind-media/gemma/gemma-re...

nl · on Feb 22, 2024

The release page has comparisons to Mistral everywhere: https://ai.google.dev/gemma

sa-code · on Feb 22, 2024

Good to know, although cmd+f for "mistral" returns 0 hits on the original link

mirekrusin · on Feb 21, 2024

They forgot.

Also phi-2.

brucethemoose2 · on Feb 21, 2024

Only 8K context as well, like Mistral.

Also, as always, take these benchmarks with a huge grain of salt. Even base model releases are frequently (seemingly) contaminated these days.

DreamGen · on Feb 21, 2024

Mistral Instruct v0.2 is 32K.

tarruda · on Feb 22, 2024

Mixtral (8x7b) is 32k.

Mistral 7b instruct 0.2 is just a fine tune of Mistral 7b.

netdur · on Feb 22, 2024

original Mistral or GGUF one?

tosh · on Feb 21, 2024

Agree: will be interesting how Gemma does on ChatBot Arena

Kydlaw · on Feb 22, 2024

They state in their report that they filter evaluation data off their training data, see p.3 - Filtering:

"Further, we filter all evaluation sets from our pre-training data mixture, run targeted contamination analyses to check against evaluation set leakage, and reduce the risk of recitation by minimizing proliferation of sensitive outputs."

YetAnotherNick · on Feb 21, 2024

According to their paper, average of standard task of Mistral is 54.0 and for Gemma it's 56.4, so 4.4% relative better. Not as big as you would expect for the company which invented transformers and probably has 2-3 order more compute for training it vs few month old French startup.

Also for note on their human evaluations, Gemma 7B IT has a 51.7% win rate against Mistral v0.2 7B Instruct.

jcuenod · on Feb 21, 2024

Came here to post the same thing for Phi-2:

  +-------------+----------+-------------+
  | Benchmark   | Gemma 2B | Phi-2 2.7B  |
  +-------------+----------+-------------+
  | MMLU        |   42.3   |     56.7    |
  | MBPP        |   29.2   |     59.1    |
  | BoolQ       |   69.4   |     83.3    |
  +-------------+----------+-------------+

[0] https://www.kaggle.com/models/google/gemma

[1] https://www.microsoft.com/en-us/research/blog/phi-2-the-surp...

rfw300 · on Feb 21, 2024

A caveat: my impression of Phi-2, based on my own use and others’ experiences online, is that these benchmarks do not remotely resemble reality. The model is a paper tiger that is unable to perform almost any real-world task because it’s been fed so heavily with almost exclusively synthetic data targeted towards improving benchmark performance.

phh · on Feb 21, 2024

Fun that's not my experience of Phi-2. I use it for non-creative context, but function calling, and I find as reliable as much bigger models (no fine-tuning just constraining JSON + CoT). Phi-2 unquantized vs Mixtral Q8, Mixtral is not definitely better but much slower and RAM-hungry.

kgeist · on Feb 21, 2024

What prompts/settings do you use for Phi-2? I found it completely unusable for my cases. It fails to follow basic instructions (I tried several instruction-following finetunes as well, in addition to the base model), and it's been mostly like a random garbage generator for me. With Llama.cpp, constrained to JSON, it also often hangs because it fails to find continuations which satisfy the JSON grammar.

I'm building a system which has many different passes (~15 so far). Almost every pass is a LLM invocation, which takes time. My original idea was to use a smaller model, such as Phi-2, as a gateway in front of all those passes: I'd describe which pass does what, and then ask Phi-2 to list the passes which are relevant for the user query (I called it "pass masking"). That would save a lot of time and collapse 15 steps to 2-3 steps on average. In fact, my Solar 10.7B model does it pretty well, but it takes 7 seconds for the masking pass to work on my GPU. Phi-2 would finish in ~1 second. However, I'm really struggling with Phi-2: it fails to reason (what's relevant and what's not), unlike Solar, and it also refuses to follow the output format (so that I could parse the output programmatically and disable the irrelevant passes). Again, my proof of concept works with Solar, and fails spectacularly with Phi-2.

phh · on Feb 21, 2024

My non-domain-specific prompt is:

> You are a helpful assistant to 'User'. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'. 'System' will give you data. Do not respond as 'System'. Allow yourself inner thoughts as 'Thoughts'.

and then I constrain its answers to Thoughts: [^\n]* and Assistant: <JSON schema>, and I have two shots included in the prompt.

I haven't been able to get anything useful out of Phi-2 in llama.cpp (but I only tried quantized models). I use python/huggingface's transformers lib instead.

nl · on Feb 22, 2024

Interesting. I've had no success at all using any of the Phi2 models.

kgeist · on Feb 23, 2024

An update on my endeavour: so, model switching is very costly under llama.cpp (I have to switch between Llama and Phi2 because my GPU has low amounts of VRAM). And this switch (reloading the weights into VRAM) defeats the whole purpose of the optimization. Having only Llama on GPU without reloading takes less time than if I'd use Llama+Phi2. And Phi2 alone is pretty bad as a general purpose LLM. So I'm quite disappointed.

lobocinza · on Feb 25, 2024

I recently upgraded to AM5 and as I have an AMD GPU I'm using llama.cpp on CPU only and I was positively surprised by how fast it generate stuff. I don't have the case of massive workloads so YMMV.

refulgentis · on Feb 21, 2024

Hear hear! I don't understand why it has persistent mindshare, it's not even trained for chat. Meanwhile StableLM 3B runs RAG in my browser, on my iPhone, on my Pixel ..

djsavvy · on Feb 21, 2024

How have you been using RAG in your browser/on your phones?

refulgentis · on Feb 21, 2024

To be released, someday [sobs in engineer]

Idea is usage-based charging for non-local and a $5/month sub for syncing.

keep an eye on @jpohhhh on Twitter if you're interested

now that I got it on web, I'm hoping to at least get a PoC up soon. I've open-sourced the consitutent parts as FONNX and FLLAMA, Flutter libraries that work on all platforms. FONNX has embeddings, FLLAMA has llama.

https://github.com/Telosnex/fonnx

https://github.com/Telosnex/fllama

myaccountonhn · on Feb 21, 2024

I tested it for an offline autocompletion tool and it was hilariously bad.

daemonologist · on Feb 21, 2024

Really looking forward to the day someone puts out an open model which outperforms Flan-T5 on BoolQ.

FergusArgyll · on Feb 21, 2024

the real gold will be when this gets finetuned. (maybe by mistral...)

brucethemoose2 · on Feb 21, 2024

TBH the community has largely outrun Mistral's own finetuning. The 7B model in particular is such a popular target because its so practical to train.

whimsicalism · on Feb 21, 2024

Strong disagree - a Mistral fine tune of llama 70b was the top performing llama fine tune. They have lots of data the community simply does not.

brucethemoose2 · on Feb 21, 2024

Miqu was (allegedly) an internal continued pretrain Mistral did as a test, that was leaked as a GGUF.

Maybe its just semantics, it is technically a finetune... But to me theres a big difference between expensive "continuation training" (like Solar 10.7B or Mistral 70B) and a much less intense finetuning. The former is almost like releasing a whole new base model.

It would be awesome if Mistral did that with their data, but thats very different than releasing a Gemma Instruct finetune.

whimsicalism · on Feb 21, 2024

There’s typically a difference in LR between a ‘continued pretrain’ and ‘fine tune.’ I don’t have the details around miqu, but was merely trying to say that Mistral could produce a better version of these models than the OSS community might. If the size of the corpora they use means we are no longer in fine tuning territory, then okay.

speedgoose · on Feb 21, 2024

Arthur Mensch, the Mistral CEO, confirmed the leak. https://twitter.com/arthurmensch/status/1752737462663684344

saintradon · on Feb 21, 2024

Also, it led to one of the funniest pr I've seen in a while

https://huggingface.co/miqudev/miqu-1-70b/discussions/10

sanjiwatsuki · on Feb 21, 2024

No shot. Mistral Medium's outputs from API were virtually identical. Miqu really was Mistral Medium which happened to be a continued pretrain

itomatik · on Feb 21, 2024

how does one finetune llama (or any other LLM) using mistral?

is the flow like this?

- take small dataset

- generate bigger dataset using mistral (how this is this done?)

- run LoRA to fine tune gemma extended dataset.

itomatik · on Feb 21, 2024

I should have said "run LoRA or your favorite fine-tuning technique to produce your fine-tuned llama."

heckl239u · on Feb 23, 2024

https://www.youtube.com/watch?v=1Mn0U6HGLeg some test vids came out on the 7b model. Shock it doesn't perform well at all.

attentive · on Feb 22, 2024

In my subjective tests it's not even close to Mistral. While my local gemma is quantized, so is mistral.

But I also tried gemma on huggingface.co/chat which I assume isn't quantized.

lawxls · on Feb 21, 2024

Honestly, this is more of a PR stunt to advertise the Google Dev ecosystem than a contribution to open-source. I'm not complaining, just calling it what it is.

Barely an improvement over the 5-month-old Mistral model, with the same context length of 8k. And this is a release after their announcement of Gemini Pro 1.5, which had an exponential increase in context length.

scarmig · on Feb 21, 2024

Who cares if it's a PR stunt to improve developer good will? It's still a good thing, and it's now the most open model out there.

moffkalast · on Feb 21, 2024

How is it more open than Mistral with Apache 2.0? Google wants people to sign a waiver to even download it.

scarmig · on Feb 21, 2024

Fair enough; that was more directed at LLaMA and derivatives, which have commercial restrictions.

observationist · on Feb 21, 2024

How exactly is it the "most open model" ?

It's more like a masterclass in corporate doublespeak. Google’s "transparency" is as clear as mud, with pretraining details thinner than their privacy protections. Diving into Google’s tech means auctioning off your privacy (and your users' privacy) to the highest bidder.

Their "open source" embrace is more of a chokehold, with their tech biases and monopolistic strategies baked into every line of code. Think of it as Google's way of marking territory - every developer is a fire hydrant.

These megacorps aren’t benevolent patrons of open source; they're self-serving giants cloaking power grabs under the guise of "progress".

Use these products at your own risk. If these companies wanted to engage in good faith, they'd use Apache or MIT licensing and grant people the agency and responsibility for their own use and development of software. Their licenses are designed to mitigate liability, handcuff potential competitors, and eke every last drop of value from users, with informed consent frequently being an optional afterthought.

That doesn't even get into the Goodharting of metrics and actual performance of the models; I highly doubt they're anywhere near as good as Mistral.

The UAE is a notoriously illiberal authoritarian state, yet even they have released AI models far more free and open than Google or Meta. https://huggingface.co/tiiuae/falcon-40b/blob/main/README.md

If it’s not Apache or MIT, (or even some flavor of GPL,) it’s not open source; it’s a trojan horse. These "free" models come at the cost of your privacy and freedoms.

These models aren't Open or Open Access or Free unless you perform the requisite mental gymnastics cooked up by their marketing and legal teams. Oceania has always been at war with Eastasia. Gemma is doubleplusgood.

stale2002 · on Feb 21, 2024

You said a lot of nothing without actually saying specifically what the problem is with the recent license.

Maybe the license is fine for almost all usecases and the limitations are small?

For example, you complained about metas license, but basically everyone uses those models and is completely ignoring it. The weights are out there, and nobody cares what the fine print says.

Maybe if you are a FAANG, company, meta might sue. But everyone else is getting away with it completely.

observationist · on Feb 21, 2024

I specifically called out the claims of openness and doublespeak being used.

Google is making claims that are untrue. Meta makes similar false claims. The fact that unspecified "other" people are ignoring the licenses isn't relevant. Good for them. Good luck making anything real or investing any important level of time or money under those misconceptions.

"They haven't sued yet" isn't some sort of validation. Anyone building an actual product that makes actual money that comes to the attention of Meta or Google will be sued into oblivion, their IP taken, and repurposed or buried. These tech companies have never behaved otherwise, and to think that they will is willfully oblivious.

They don't deserve the benefit of the doubt, and should be called out for using deceitful language, making comparisons between their performative "openness" and actual, real, open source software. Mistral and other players have released actually open models and software. They're good faith actors, and if you're going to build a product requiring a custom model, the smart money is on Mistral.

FAANG are utilizing gotcha licenses and muddying the waters to their own benefit, not as a contribution to the public good. Building anything on the assumption that Meta or Google won't sue is beyond foolish. They're just as open as "Open"AI, which is to say not open at all.

stale2002 · on Feb 21, 2024

> Anyone building an actual product that makes actual money that comes to the attention of Meta or Google will be sued into oblivion

No they won't and they haven't.

Almost the entire startup scene is completely ignoring all these licenses right now.

This is basically the entire industry. We are all getting away with it.

Here's an example, take llama.

Llama originally disallowed commercial activity. But then the license got changed much later.

So, if you were a stupid person, then you followed the license and fell behind. And if you were smart, you ignored it and got ahead of everyone else.

Which, in retrospect was correct.

Because now the license allows commerical activity, so everyone who ignores it in the first place got away with it and is now ahead of everyone else.

> won't sue is beyond foolish

But we already got away with it with llama! That's already over! It's commerical now, and nobody got sued! For that example, the people who ignored the license won.

esafak · on Feb 21, 2024

The nice thing about this is that the calculus is in favor of startups, who can roll the dice.

crossroadsguy · on Feb 21, 2024

That’s about the point of having a developer ecosystem, isn’t it?

kiraaa · on Feb 21, 2024

mistral 7b v0.2 supports 32k

brucethemoose2 · on Feb 21, 2024

This is a good point actually, and an underappreciated fact.

I think so many people (including me) effectively ignored Mistral 0.1's sliding window that few realized 0.2 instruct is native 32K.

tarruda · on Feb 22, 2024

Mixtral 8x7B has 32k context.

Mistral 7b instruct 0.2 is just an instruct fine tune of Mistral 7b and stays with a 8k context.