I guess the next step is to see if you're getting those mega activations as he d... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		Grosvenor on Oct 8, 2024 \| parent \| context \| favorite \| on: Differential Transformer I guess the next step is to see if you're getting those mega activations as he describes. A/B test the two models and compare? Would be interesting to see if these activations only show up on larger models, or they're some relation to model size.

Grosvenor on Oct 9, 2024 [–]

https://news.ycombinator.com/item?id=36871528

Hah. Yes. It looks like they only show up in models with 6.7B parameters or more.

The problem can start at 125M. Small enough to test on a whim.

So train a model that exhibits these behaviours, then try it out.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact