Looking at the config.json of Gemma 7B the feedfoarward hidden size is 8x, not 1...

espadrine · on Feb 21, 2024

Huh, indeed, that's what the config.json[0] says; the report[1] indicates “Feedforward hidden dims: 49152”.

[0]:https://huggingface.co/google/gemma-7b-it/blob/main/config.j...

[1]: https://storage.googleapis.com/deepmind-media/gemma/gemma-re...

GaggiX · on Feb 21, 2024

I don't see the number 49152 reported in the config.json, what line are you referring to? I just see the intermediate_size of 24576 (so 8x).

EDIT: I didn't read the comment correctly, you have noticed the same thing.

voxgen · on Feb 21, 2024

The *GLU-based activations functions like GEGLU and SwiGLU use 2 input values to produce 1 output value, which makes these numbers weird. In each value pair, one goes through the GELU/SiLU activation function and is then multiplied by the other "gate" value.

In the report, "hidden dim" matches the number of GEGLU inputs. In the config, "intermediate_size" matches the number of GEGLU outputs. Most *GLU models so far have used intermediate_size=8/3*d_model as this makes have the same number of matmul FLOPS & parameters as a 4x-expanded non-GLU model, and PaLM vaguely showed that 4x is better than a smaller expansion factor.

If one considers Llama-2-7B's FFN expansion factor to be ~5.33x, Gemma's expansion factor is 16x.

GaggiX · on Feb 21, 2024

Makes perfect sense thx

SahAssar · on Feb 21, 2024

Read the parent comment again. It says the paper says 49152, not the config.json.