How do you guys manage regressions as a whole with every new model update? A mas...

bcherny · 2026-04-06T18:03:59 1775498639

A mix of evals and vibes.

efields · 2026-04-06T19:50:31 1775505031

"Evals and vibes" can I put that on a t shirt?

giwook · 2026-04-06T18:05:34 1775498734

What's that ratio exactly

capnchaos · 2026-04-06T18:17:28 1775499448

Are you doing any Digital Twin testing or simulations? I imagine you can't test a product like Claude Code using traditional means.

cududa · 2026-04-06T22:16:57 1775513817

Remember when they shipped that version that didn't actually start/ run? At work we were goofing on them a bit, until I said "Wait how did their tests even run on that?" And we realized whatever their CI/CD process is, it wasn't at the time running on the actual release binary... I can imagine their variation on how most engineers think about CI/CD probably is indicative of some other patterns (or lack of traditional patterns)

As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria.

Why not provide pinable versions or something? This episode and wasted 2 months of suboptimal productivity hits on the absurdity of constantly changing the user/ system prompt and doing so much of the R&D and feature development at two brittle prompts with unclear interplay. And so until there’s like a compostable system/user prompt framework they reliably develop tests against, I personally would prefer pegged selectable versions. But each version probably has like known critical bugs they’re dancing around so there is no version they’d feel comfortable making a pegged stable release..

xnorswap · 2026-04-07T10:14:55 1775556895

That was actually an interesting case of things that CI/CD don't tend to catch.

It failed to start because it failed to parse the published release notes.

In the CI/CD system it would have passed, because the release notes that broke it, hadn't been published yet.

Those release notes also took down previous versions of claude-code too, rolling back didn't help users.

The breakage wasn't a change in the software, it was a change in the release notes which coincided with the change in the software.

Now, should it have been grabbing release notes and parsing them? No, that's unbelievably dumb (and potentially dangerous), but it wasn't an issue with missing CI/CD, but an interesting case-study in CI/CD gaps and how CI/CD can actually lead to over-confidence.

misnome · 2026-04-06T22:57:59 1775516279

about once a week I get a claude "auto update" that fails to start with some bun error on our linux machines. It's beyond laughable.

try-working · 2026-04-06T20:44:45 1775508285

I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow