I understand this is ELI5, but doesn’t attention already do this, in the way you...

imtringued · on April 8, 2024

Not from a computational perspective. To calculate the attention score you have to calculate every token against every other token. That is quadratic. Every article like one, the, a, etc will have to be calculated against every other word even though they are only revelvant within a short distance of the word they are attached to.

randcraw · on April 8, 2024

Isn't that factorial, and much more costly than quadratic?

dzamo_norton · on April 10, 2024

N choose 2 = N! / 2!(N-2)! = N(N-1) / 2.

xcv123 · on April 8, 2024

The way I understood it is that for each token, the attention mechanism itself consumes a fixed amount of processor time.

The innovation here is to prioritize tokens so that some tokens have more or less processor time.