Anthropic's position is that thinking tokens aren't actually faithful to the internal logic that the LLM is using, which may be one reason why they started to exclude them:
That's interesting research, but I think a more important reason that you don't have access to them (not even via the bare Anthropic api) is to prevent distillation of the model by competitors (using the output of Anthropic's model to help train a new model).
If you can't trust a company, don't use their api or cloud services. No amount of external output will ever validate anything, ever. You never know what's really happening, just because you see some text they sent you.
Do you not see that the next (or previous) logical step would be a "commercial ban" of frontier models, all "distilled" from an enormous amount of copyrighted material?
That probably matters for some scenarios, but I have yet to find one where thinking tokens didn't hint at the root cause of the failure.
All of my unsupervised worker agents have sidecars that inject messages when thinking tokens match some heuristics. For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).
> For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).
It's so weird to see language changes like this: Outside of LLM conversations, a pragmatic fix and a correct fix are orthogonal. IOW, fix $FOO can be both.
From what you say, your experience has been that a pragmatic fix is on the same axis as a correct fix; it's just a negative on that axis.
It's contextual though, and pragmatic seems different to me than correct.
For example, if you have $20 and a leaking roof, a $20 bucket of tar may be the pragmatic fix. Temporary but doable.
Some might say it is not the correct way to fix that roof. At least, I can see some making that argument. The pragmatism comes from "what can be done" vs "should be".
From my perspective, it seems viable usage. And I guess on wonders what the LLM means when using it that way. What makes it determine a compromise is required?
(To be pragmatic, shouldn't one consider that synonyms aren't identical, but instead close to the definition?)
> It's contextual though, and pragmatic seems different to me than correct.
To me too, that's why I say they are measurements on different dimensions.
To my mind, I can draw a X/Y axis with "Pragmatic" on the Y and "Correctness" on the X, and any point on that chart would have an {X,Y} value, which is {Pragmatic, Correctness}.
If I am reading the original comment correctly, poster's experience of CC is that it is not an X/Y plot, it is a single line plot, with "Pragmatic" on the extreme left and "Correctness" on the extreme right.
Basically, any movement towards pragmatism is a movement away from correctness, while in my model it is possible to move towards Pragmatic while keeping Correctness the same.
I don't think it's a single axis even in the original poster's conception, since you could be both incorrect and also not pragmatic.
But if a fix needs to be described as pragmatic relative to the alternatives, that's probably because it couldn't be described as correct. Otherwise you wouldn't be talking about how pragmatic it is.
> also whenever "pre-existing issue" appears (it's never pre-existing)
I dunno... There were some pre-existing issues in my projects. Claude ran into them and correctly classified as pre-existing. It's definitely a problem if Claude breaks tests then claims the issue was pre-existing, but is that really what's happening?
I had some interesting experience to the opposite last night, one of my tests has been failing for a long time, something to do with dbus interacting with Qt segfaulting pytest. Been ignoring it for a long time, finally asked claude code to just remove the problematic test. Come back a few minutes later to find claude burning tokens repeatedly trying and failing to fix it. "Actually on second thought, it would be better to fix this test."
Match my vibes, claude. The application doesn't crash, so just delete that test!
I somewhat understand Anthropic's position. However, thinking tokens are useful even if they don't show the internal logic of the LLM. I often realize I left out some instruction or clarification in my prompt while reading through the chain of reasoning. Overall, this makes the results more effective.
It's certainly getting frustrating having to remind it that I want all tests to pass even if it thinks it's not responsible for having broken some of them.
What's the implication of this? That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?
But reasoning does improve performance on many tasks, and even weirder, the performance improves if reasoning tokens are replaced with placeholder tokens like "..."
I don't understand how LLMs actually work, I guess there's some internal state getting nudged with each cycle?
So the internal state converges on the right solution, even if the output tokens are meaningless placeholders?
>That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?
Yes it plans ahead, but with significant uncertainty until it actually outputs these tokens and converges on a definite trajectory, so it's not a useless filler - the closer it is to a given point, the more certain it is about it, kind of similar to what happens explicitly in diffusion models. And it's not all that happens, it's just one of many competing phenomena.
So like many of the promises from AI companies, reported chain of thought is not actually true (see results below). I suppose this is unsurprising given how they function.
Is chain of thought even added to the context or is it extraneous babble providing a plausible post-hoc justification?
People certainly seem to treat it as it is presented, as a series of logical steps leading to an answer.
‘After checking that the models really did use the hints to aid in their answers, we tested how often they mentioned them in their Chain-of-Thought. The overall answer: not often. On average across all the different hint types, Claude 3.7 Sonnet mentioned the hint 25% of the time, and DeepSeek R1 mentioned it 39% of the time. A substantial majority of answers, then, were unfaithful.‘
I mean, obviously, it's not going to be a faithful representation of the actual thinking. The model isn't aware of how it thinks any more than you are aware how your neurons fire. But it does quantitatively improve performance on complex tasks.
As you can see from posts on this story, most people believe it reflects what the model is thinking and use it as a guide to that so they can ‘correct’ it. If it is not in fact chain of thought or thinking it should not be called that.
It is the same with human chain of thought, though. Both of them are post-hoc rationalisations justifying "gut feelings" that come from thought processes the human/agent doesn't have introspection into. And yet asking humans or machines to "think out loud" this way does increase the quality of their work.
I disagree - humans often reason in a series of steps, and can write these down before they've reached an answer. They don't always wait till they reach a conclusion (with no self-insight into how they did so) and then retrospectively generate a plausible answer as LLMs do.
In mathematical proofs they may guess and answer and then work out a proof, but that is a different process.
https://www.anthropic.com/research/reasoning-models-dont-say...