You can define them in a structured way that's not tied to a specific programming language. Imagine a test suite that's entirely YAML inputs and outputs, or JSON, or even CSV.
The key idea is to have one test suite/specification that multiple implantations in different languages can share.
What is the advantage of that over programming languages though? At some point you’re just creating a new specification language which needs to be learned. If an LLM can go from English spec to Python unit tests why not just start with, or at least distribute, Python unit tests. A programming language will allow you to be significantly more correct and consistent than English.
Because if the tests are in Python the LLM still has to convert them from Python to Ruby or whatever, which leaves room for mistakes to creep in.
If the tests are in YAML it doesn't need to convert them at all. It can write a new test harness in the new language and run against those existing, deterministic tests.
My point is that to create a specification you need to use a formal language of some kind. In this example they created a new yaml based specification language. Why do that vs use a well documented existing formal language the LLM knows well like Python. The translation is either yaml -> new language or Python -> new language. The translation is happening in both cases.
The advantage I can think of is it would might be more human readable but Python is damn close to pseudocode. It’ll likely always be a bit annoying to write because it has to be a formal language.
When told "use red/green TDD to write code for this in Ruby", a coding agent like Claude Code will write a test harness in Ruby that loops through all of those YAML tests, run it and watch it fail, then write just enough Ruby that the tests pass.
Yea I guess we're having a definitional disagreement here. To be clear I think this is a good idea and the work you've done using tests from projects to have agents translate libraries is awesome.
But to me clearly that YAML snippet you provided is a specification which needs to be translated to Ruby as much as Python would. If the equivalent Python is:
The YAML is no more clear than the Python, nor closer to Ruby. Honestly I think it's less clear as a human reading it because it's hard to tell which function is being tested in context of a specific test case. I guess it's possible Claude is better at working with the YAML than the Python but that would be a coincidence I think.
I've been exploring this pattern recently too. Giving current coding agents an existing conformance or test suite and telling them to keep writing code unto the tests pass is astonishingly effective.
I've now got a JavaScript interpreter and a WebAssembly runtime written in Python, built by Claude Code for web run from my phone.
I'm not crazy about the way this installs Deno as `/usr/local/bin/deno` (on Linux systems at least). I was hoping it would leave that executable tucked away in Python site-packages somewhere out of the way.
This shouldn't be possible as long as you get a wheel (try configuring the installer to require wheels). The download script in the source distribution can do what it wants, of course, and I agree that this isn't the greatest behaviour.
My guess is that they do this in order to put the binary in a specific location of a container, as part of their own build process. The ecosystem doesn't distinguish between "source distributions" intended to build on a user's vs. developer's machine.
> Creating any one single behavior in a computer system is almost always trivial for the experienced engineer. When the experienced engineer on your team says that something can’t be done easily, they almost always mean is that the thing can’t be done easily in a way that is acceptable to the health of the product. Junior engineers tend not to have to consider this constraint.
I completely agree. One of the things that makes a senior engineer senior is the ability to design and implementing code with the health of the overall system in mind.
This is really hard, especially since you simultaneously have to resist the temptation to build abstractions for a future that may not come to pass - sticking to the YAGNI principle https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it
There's also a "source" distribution that installs by downloading from the main project's GitHub release page. But that's presumably limited to the same platform support.
The yt-dlp project also raised concerns that the manylinux wheel incorrectly advertises older glibc support.
... but I always bristle a bit at the "one-liner to run without installing" description. Sure, the ergonomics are great, but you do still have to download the whole thing, and it does create a temporary installation that is hard-linked from a cache folder that is basically itself an installation.
Sure, but as an end-user you don't have to think about installation at all. That's a huge win - mainly because it eliminates the "Did I install this already? Where did I put it? What's the command for doing that again?" mental overhead.
> AI coders keep saying they review all the code they push
Those tides have shifted over the past 6 weeks. I'm increasingly seeing serious, experienced engineers who are using AI to write code and are not reviewing every line of code that they push, because they've developed a level of trust in the output of Opus 4.5 that line-by-line reviews no longer feel necessary.
(I'm hesitant to admit it but I'm starting to join their ranks.)
It's a fast starting and fast pausing persistent VM, with a ton of built in developer tools (including a preconfigured Claude Code) and an extra JSON API for executing commands within it so you can treat it as a sandbox.
I'm really excited about https://sprites.dev/ - it hits two of my favourite problems at once:
1. Developer environment sandboxes. This is a cheap and convenient way to run Claude Code / Codex CLI / etc in YOLO mode in a persistent sandboxed VM with a restricted blast radius if something goes wrong.
2. Sandbox API. Fly now have a product that lets me make a simple JSON API call to run untrusted code in a new sandbox. There's even snapshotting support so I can roll back to a known state after running that code.
BTW Simon, I was super happy when I heard on Theo's podcast that he will be encouraging you to monetise your work more. I'm super appreciative of your work and I'm pretty convinced that the more you profit from it, the better the universe will be!!!
reply