Didn't thinking tokens resolve the most problematic part of autoregressive models (the first few tokens set the constraints the model can't overcome later) and give it a massive advantage compared to diffusion models by showing the thinking trace? I can see diffusion models being used as a draft model to quickly predict a bunch of tokens and let the autoregressive model decide to use them or throw them away quickly, speeding it up considerably while keeping thinking traces available.
The reason I mentioned "purely autoregressive" is that realistically I expect hybrid diffusion + autoregressive models to be the first popular diffusion models. I could be wrong though. And diffusion models have other tricks like really easy integration with simple classifiers.