Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not sure, and not completely convinced of the explanation, but the way this sticks out so obviously makes it look like a honeypot to me.


Great theory. I'll dig deeper.


Claude Code has a server-side anti-distillation opt-in called fake_tools, but the local code does not show the actual mechanism.

The client sometimes sends anti_distillation: ['fake_tools'] in the request body at services/api/claude.ts:301

The client still sends its normal real tools: allTools at services/api/claude.ts:1711

If the model emits a tool name the client does not actually have, the client turns that into No such tool available errors at services/tools/StreamingToolExecutor.ts:77 and services/tools/toolExecution.ts:369

If Anthropic were literally appending extra normal tool definitions to the live tool set, and Claude used them, that would be user-visible breakage.

That leaves a few more plausible possibilities:

Fake_tools is just the name of the server-side experiment, but the implementation is subtler than “append fake tools to the real tool list.”

or

The server may inject tool-looking text into hidden prompt context, with separate hidden instructions not to call it.

or

The server may use decoys only in an internal representation that is useful for poisoning traces/training data but not exposed as real executable tools.


We do know that Anthropic has the ability to detect when their models are being distilled, so there could be some backend mechanism that needs to be tripped to observe certain behaviour. Not possible to confirm though.


Who's we, and how do you know this?


We can be used to refer to people in general, and we know because Anthropic published a post called "Detecting and preventing distillation attacks" a month ago, while calling out 3 AI labs for large scale distillation

https://www.anthropic.com/news/detecting-and-preventing-dist...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: