Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I understand this is ELI5, but doesn’t attention already do this, in the way you described? It pays specific focus to the most contextual words in the prior sequence.


Not from a computational perspective. To calculate the attention score you have to calculate every token against every other token. That is quadratic. Every article like one, the, a, etc will have to be calculated against every other word even though they are only revelvant within a short distance of the word they are attached to.


Isn't that factorial, and much more costly than quadratic?


N choose 2 = N! / 2!(N-2)! = N(N-1) / 2.


The way I understood it is that for each token, the attention mechanism itself consumes a fixed amount of processor time.

The innovation here is to prioritize tokens so that some tokens have more or less processor time.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: