Are there any good papers on this you would suggest/specific search terms? I am ...

Are there any good papers on this you would suggest/specific search terms?

I am vaguely aware of some stuff, but would love to study more, I don't quite understand what this is all about (but I do see how LLMs can do attention to all prior tokens so you don't have the single-point-of-failure HMMs do which more necessitates Viterbi decodes)