One of the teams I worked for in AWS used TLA+ on a new back-end feature that was added. It was one that, if things went wrong, would be extremely bad for customers. The needs for operating on scale only added to the necessary complexity. Paranoia was key.
They'd estimated something like an optimistic 4 months for the work.
Two of the more senior developers were assigned the task, and they had to learn TLA+ and formal methods from scratch. (one dev's main complaint was that he was forced to use Eclipse, I don't know if integration has now happened with other IDEs since then).
It took a little bit for them to get up to speed, but they were soon implementing and catching bugs in the proposed designs, and they had all the bugs ironed out within a few weeks.
They then found that going from TLA+ to actual Java code was almost a "fill in the blanks" exercise. The formal model provided almost the exact structure they needed for their final code. Going from model to code took them very little time at all. It was in production, running in shadow mode after not even 2 months, and was fully live not long afterwards.
TLA+ got subsequently used for a number of other features added to the platform.
There had been general push back among developers that learning TLA+ and doing formal verification was only going to slow them down, but we found it was quite to the opposite. TLA+ actually sped up development, with the big bonus of higher confidence of correctness.
I sort of wrote a vim plugin, but it's nowhere near as sophisticated as the above three. I also recently made public a personal CLI tool (https://github.com/hwayne/tlacli) for running TLC on the command line. It's experimental but I find it really useful.
Not very comparable. TLA+ involves writing statements in a custom "logic" which adds modalities (i.e. monads) to support 'time' (or rather, state transitions in discrete steps, ala temporal logic) and non-determinism (which is reused to support refinement) to standard propositional logic.
Nothing like this can be done in a standard programming language (even Scala or Haskell), and even in a dependently-typed language you would need to write some support code in order to embed this logic and work with it within the system. (It wouldn't be hard for sure, since even FOL gives you far more power than any (multi-)modal logic, but it's a different, perhaps more elegant take on things than just working directly within TLA).
Here's a presentation by Eric, a distinguished engineer, and Neha, a principal engineer, on analysing IAM policies (Zelkova?) and Network reachability (Tiros?): https://www.youtube.com/watch?v=x6wsTFnU3eY
Yes, since we published this (I'm one of the authors), the Automated Reasoning Group have done amazing work applying formal methods to a broad range of problems at AWS. We're also still using TLA+ in many places for the reasons the paper describes.
One aspect of TLA+ (and formal specifications in general) that I under appreciated when we wrote this was how useful they are as a communication tool. I've found well-commented TLA+ to be an excellent medium for reviewing and documenting protocols, and having high-bandwidth conversations about them. Complex protocol interactions can be hard to describe clearly in text, but the mix of text and TLA+ (or PlusCal) seems to work well.
Does your team hire PhDs with formal verification backgrounds? If so, what is the interview process like? Do you make them go through the standard leetcode type interviews?
Questions I wondered about that weren't answered until the end of the paper:
Q: Do engineers actually implement AWS systems in TLA+?
A: No, TLA+ is called "exhaustively testable pseudo-code" internally, and is only used for designs.
Q: Which AWS systems use it?
A: The first was DynamoDB, followed by S3, and now there are 8 additional complex systems that use it for designs.
Q: Who uses TLA+ at Amazon?
A: All engineers, from entry level to principal.
I thought it was especially interesting that several engineers at AWS were effectively writing their own model checkers by writing programs that tried to brute force rare combinations of conditions that would lead to system failure. Obviously TLA+ does this better and faster so they switched to that.
Practical TLA+ is indeed a great place to start, and Hillel Wayne's writing out formal methods is all worth reading.
Lamport's TLA+ book, and "Specifying Systems" are both available online (https://lamport.azurewebsites.net/tla/book.html). That one's worth a read too. Lamport's approach to teaching TLA+ is different, and probably more suited to the mathematically-inclined.
If you like spelunking through history, I'd also recommend checking out some of Lamport's papers on TLA. I particularly like "What Good is Temporal Logic?" which lays out his early arguments about why temporal logic is useful for solving some classes of problems. The paper is available here: https://www.microsoft.com/en-us/research/uploads/prod/2016/1...
Does anyone have resources that talk about writing a TLA+ model for an existing software system as opposed to designing a new system using TLA+ as the design reference?
I introduced TLA+ to help understand underlying components in the OpenStack ecosystem and at small companies to solve thorny problems. It's well worth learning as it gives you a very different perspective on the design and construction of systems.
One thing I'm wondering is whether (or rather, what kind of) bugs arise when you try to translate a model (described as "exhaustively tested pseudocode" elsewhere in this thread) into a real system. For example, if there is if-branch somewhere that gets missed etc?
I am the engineer who translated https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/... into Java (https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/...). The translation took me about half a day. Later, while running scalability experiments, a correctness bug intermittently showed up, on which I spent approximately two weeks to find the root cause. The two weeks included fixing around a dozen bugs in code that was not specified in TLA+. However, none of these bugs were correctness bugs and unrelated to the correctness bug. Eventually, I translated the TLA+ spec into input for Java Pathfinder, which found the cause of the correctness bug immediately: The Java translation contained an off-by-one bug in a for loops (TLA+ is one-indexed while Java is zero-indexed). I am still convinced that a code review would have found the bug and that JPF was a very heavyweight substitute for a fresh pair of eyes. In other words, the bugs that get introduced during the translation phase are shallow and easily corrected.
Note that no other bugs have been reported for the code since.
A TLA+ specification is generally a simplified model of a real system design. So issues can definitely arise when building the real thing, but model testing on a TLA+ spec can definitely help with quickly surfacing bugs that would exist even in a simplified design. Just don't expect it to do more than that.
They'd estimated something like an optimistic 4 months for the work.
Two of the more senior developers were assigned the task, and they had to learn TLA+ and formal methods from scratch. (one dev's main complaint was that he was forced to use Eclipse, I don't know if integration has now happened with other IDEs since then).
It took a little bit for them to get up to speed, but they were soon implementing and catching bugs in the proposed designs, and they had all the bugs ironed out within a few weeks.
They then found that going from TLA+ to actual Java code was almost a "fill in the blanks" exercise. The formal model provided almost the exact structure they needed for their final code. Going from model to code took them very little time at all. It was in production, running in shadow mode after not even 2 months, and was fully live not long afterwards.
TLA+ got subsequently used for a number of other features added to the platform.
There had been general push back among developers that learning TLA+ and doing formal verification was only going to slow them down, but we found it was quite to the opposite. TLA+ actually sped up development, with the big bonus of higher confidence of correctness.