Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm the author of the report in there. The stop-phrase-guard didn't get attached but here it is: https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

You can watch for these yourself - they are strong indicators of shallow thinking. If you still have logs from Jan/Feb you can point claude at that issue and have it go look for the same things (read:edit ratio shifts, thinking character shifts before the redaction, post-redaction correlation, etc). Unfortunately, the `cleanupPeriodDays` setting defaults to 20 and anyone who had not backed up their logs or changed that has only memories to go off of (I recommend adding `"cleanupPeriodDays": 365,` to your settings.json). Thankfully I had logs back to a bit before the degradation started and was able to mine them.

The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me <in some unmentioned configuration>" - sucks :/



The "this test failure is preexisting so I'm going to ignore it" thing has been happening a lot for me lately, it's so annoying. Unless it makes a change and then immediately runs tests and it's obvious from the name/contents that the failing test is directly related to the change that was made it will ignore it and not try to fix.


This problem has been around for a long time. Not only that but it would say this even when the problems were directly caused by their code.

I put a line in my CLAUDE.md that says "If a test doesn't pass, fix it regardless of whether it was pre-existing or in a different part of the code."


This should be part of the system prompt. It's absolutely unacceptable to just to not at least try to investigate failures like this. I absolutely hate when it reaches this conclusion on its own and just continues on as if it's doing valid work.


Based on the recent leaks, their system prompt explicitly nudges the model not to do anything outside of what was asked. That could very well explain why it’s not fixing preexisting broken tests.

“Don't add features, refactor code, or make "improvements" beyond what was asked.”

https://www.dbreunig.com/2026/04/04/how-claude-code-builds-a...


And it's very valid. Because otherwise you would ask Claude to trim a tree and it would go raze the whole forest and plant new seeds. This was the primary pain point last year, especially with Sonnet.


Whatever prompting OpenAI has with Codex / GPT 5.4 seems superior here then.

It's very surgical and careful around incremental refactoring, etc. but it also doesn't avoid responsibility.


> "this test failure is preexisting so I'm going to ignore it"

Critical finding! You spotted the smoking gun!


I will note that this "out" that Claude takes was a) less frequent in Opus 4.5 and that time frame and b) notably not something that Codex does.

I don't trust the code that Claude writes at all, if I have to use it (they gave me a free month recently, so I use it...) I not only review it carefully but have Codex do a thorough review.

Claude "cheats" and leaves hacks and has Dunning-Kruger.

All of this is very exhausting. I am enjoying writing my own code with these tools (to get long running personal projects out the door) but the effect that these tools are having on teams is terrifyingly corrosive and it's making me want to take an early retirement from the profession.

Yes we can write a lot of code quickly. But at what cost? And what even use is all this code now anyways?


That said I've worked with several humans who did/said the exact same thing.


But did they say that about tests they just added themselves too? Had claude try that on me a couple of times >_<


Usually these were the developers who said their code didn’t need tests because it’s obviously correct/too simple to need them. And then their bug causes a crash that needs to be fixed over the weekend :/


I can't believe that's where we're at, as software devs. I miss predictable outputs, state machines. All those LLM (prompt) based rules make no sense to me. Same with AI WAL. All of it, at some point, will fail.


I present a new name for this - FAKE CODE.

This is simply the next iteration of FAKE NEWS. We have been steadily democratizing and thus lowering the verification standards:

Verified News (AP/Reuters) --> Opinion pieces (Fox/CNN) --> Social media (Tiktok/Youtube).

Verified Code --> Vibe Code

Democracy gave everyone a vote - was that a good thing ?

Social media gave everyone a visual - was that a good thing ?

AI gave everyone a vibe - was that a good thing ?

The trust factor never went away. It just got dispersed and diluted.


> I can't believe that's where we're at, as software devs

Agree wholeheartedly.

The premise of the bug did not make any sense to me. For instance, "unusable for complex engineering tasks", why would someone who understands these tools use them for complex engineering tasks ? Also, this phrase in the bug appears too jargon-ny "Extended Thinking Is Load-Bearing for Senior Engineering Workflows" - what does this even mean ? Am I the only one who is looking at this with bewilderment. I think there is group of folks producing almost-working proof of concept code with these tools, and will face a reckoning at some point - as the bug illustrates. I see this as a storm in a teacup with wonder and amusement.

There is also a larger commentary on: when you dont understand why things work (ie, have a causal model), you wont know why they broke (find root causes). We are at a point in our craft where we throw magic dust and chant spells at claude and hope and pray it works.


Yeah that. After spending years trying to get reproducible builds, I now have a crazy moving target to deal with.


It's hard not to feel deeply depressed by it.

But we can't put the genie back in the bottle.


> is consumer-hostile thinking

I've been saying this with many of my friends but, I feel like it's also probably illegal: you paid for a subscription where you expect X out of, and if they changed the terms of your subscription (e.g. serving worse models) after you paid for it, was that not false advertising? Could we not ask for a refund, or even sue?


Depends on the terms and conditions


Where I live, the law is above some silly terms and conditions.


Contract law is law. But I know what you mean


probably not. the engineers dont even know how these things work (see: black box) so how could you even prove that its not doing what it's 'supposed' to be doing?


I'm curious about your subscription/API comparison with respect to thinking. Do you have a benchmark for this, where the same set of prompts under a Claude Code subscription result in significantly different levels of effective thinking effort compared to a Claude Code+API call?

Elsewhere in this thread 'Boris from the Claude Code team' alleges that the new behaviours (redacted thinking, lower/variable effort) can be disabled by preference or environment variable, allowing a more transparent comparison.


GP already said they applied all those settings.


I wonder if they’ve had so many new signups lately that they just don’t have enough capacity, so they fiddled with the defaults so they could respond to everyone? Could it be as simple as that?


Thanks for your report.

> a silently-introduced limitation of the subscription plan

It is a fact that the API consumers aren't affected by this?

> if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that.

Absolutely agreed.


Hello Claude.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: