How Amazon Web Services Uses Formal Methods (2015) [pdf]

Twirrim · on Jan 20, 2020

One of the teams I worked for in AWS used TLA+ on a new back-end feature that was added. It was one that, if things went wrong, would be extremely bad for customers. The needs for operating on scale only added to the necessary complexity. Paranoia was key.

They'd estimated something like an optimistic 4 months for the work.

Two of the more senior developers were assigned the task, and they had to learn TLA+ and formal methods from scratch. (one dev's main complaint was that he was forced to use Eclipse, I don't know if integration has now happened with other IDEs since then).

It took a little bit for them to get up to speed, but they were soon implementing and catching bugs in the proposed designs, and they had all the bugs ironed out within a few weeks.

They then found that going from TLA+ to actual Java code was almost a "fill in the blanks" exercise. The formal model provided almost the exact structure they needed for their final code. Going from model to code took them very little time at all. It was in production, running in shadow mode after not even 2 months, and was fully live not long afterwards.

TLA+ got subsequently used for a number of other features added to the platform.

There had been general push back among developers that learning TLA+ and doing formal verification was only going to slow them down, but we found it was quite to the opposite. TLA+ actually sped up development, with the big bonus of higher confidence of correctness.

hwayne · on Jan 20, 2020

> (one dev's main complaint was that he was forced to use Eclipse, I don't know if integration has now happened with other IDEs since then)

There are a few projects to make this more accessible. We have

vscode: https://github.com/alygin/vscode-tlaplus

Jupyter notebooks: https://github.com/kelvich/tlaplus_jupyter

emacs: https://github.com/mrc/tla-tools

I sort of wrote a vim plugin, but it's nowhere near as sophisticated as the above three. I also recently made public a personal CLI tool (https://github.com/hwayne/tlacli) for running TLC on the command line. It's experimental but I find it really useful.

bordercases · on Jan 20, 2020

How do you believe using TLA+ compares to the "rich static typing" systems that pump up the valuations of Scala and Haskell?

zozbot234 · on Jan 20, 2020

Not very comparable. TLA+ involves writing statements in a custom "logic" which adds modalities (i.e. monads) to support 'time' (or rather, state transitions in discrete steps, ala temporal logic) and non-determinism (which is reused to support refinement) to standard propositional logic.

Nothing like this can be done in a standard programming language (even Scala or Haskell), and even in a dependently-typed language you would need to write some support code in order to embed this logic and work with it within the system. (It wouldn't be hard for sure, since even FOL gives you far more power than any (multi-)modal logic, but it's a different, perhaps more elegant take on things than just working directly within TLA).

bordercases · on Jan 20, 2020

Thanks for your response.

ignoramous · on Jan 20, 2020

Bryon Cook, a principal scientist at Automated Reasoning Group gives an overview of various solvers in-use today at AWS: https://www.youtube.com/watch?v=UKqVY0SSbus

Here's a presentation by Eric, a distinguished engineer, and Neha, a principal engineer, on analysing IAM policies (Zelkova?) and Network reachability (Tiros?): https://www.youtube.com/watch?v=x6wsTFnU3eY

This webpage has many more links to blog articles on the topic: https://aws.amazon.com/security/provable-security/

mjb · on Jan 20, 2020

Yes, since we published this (I'm one of the authors), the Automated Reasoning Group have done amazing work applying formal methods to a broad range of problems at AWS. We're also still using TLA+ in many places for the reasons the paper describes.

One aspect of TLA+ (and formal specifications in general) that I under appreciated when we wrote this was how useful they are as a communication tool. I've found well-commented TLA+ to be an excellent medium for reviewing and documenting protocols, and having high-bandwidth conversations about them. Complex protocol interactions can be hard to describe clearly in text, but the mix of text and TLA+ (or PlusCal) seems to work well.

exdsq · on Jan 20, 2020

Is TLA+ only ever used as a tool by developers or do you have "formal methods" engineers who are only involved with that step of the process?

amortize_ · on Jan 20, 2020

Does your team hire PhDs with formal verification backgrounds? If so, what is the interview process like? Do you make them go through the standard leetcode type interviews?

exdsq · on Jan 20, 2020

Side note - did you do a PhD in formal verification? I'd love to hear more about it? I am seriously considering it myself.

amortize_ · on Jan 20, 2020

I have not, but like you am considering getting a PhD in this area.

lwb · on Jan 20, 2020

Questions I wondered about that weren't answered until the end of the paper:

Q: Do engineers actually implement AWS systems in TLA+? A: No, TLA+ is called "exhaustively testable pseudo-code" internally, and is only used for designs.

Q: Which AWS systems use it? A: The first was DynamoDB, followed by S3, and now there are 8 additional complex systems that use it for designs.

Q: Who uses TLA+ at Amazon? A: All engineers, from entry level to principal.

I thought it was especially interesting that several engineers at AWS were effectively writing their own model checkers by writing programs that tried to brute force rare combinations of conditions that would lead to system failure. Obviously TLA+ does this better and faster so they switched to that.

mjb · on Jan 20, 2020

Thanks for the useful summary.

> All engineers, from entry level to principal.

I'd phrase that as "some engineers of all levels". It's certainly not true that all engineers at AWS use TLA+ :)

lwb · on Jan 21, 2020

Yes, thanks for the correction!

exdsq · on Jan 20, 2020

I’ve started using TLA+ in my spare time. There are several good sites and books to learn more about it:

- https://learntla.com/introduction/

- Pracitcal TLA+

It’s much easier to get running with than other formal methods tools like Isabelle and Coq.

mjb · on Jan 20, 2020

Practical TLA+ is indeed a great place to start, and Hillel Wayne's writing out formal methods is all worth reading.

Lamport's TLA+ book, and "Specifying Systems" are both available online (https://lamport.azurewebsites.net/tla/book.html). That one's worth a read too. Lamport's approach to teaching TLA+ is different, and probably more suited to the mathematically-inclined.

If you like spelunking through history, I'd also recommend checking out some of Lamport's papers on TLA. I particularly like "What Good is Temporal Logic?" which lays out his early arguments about why temporal logic is useful for solving some classes of problems. The paper is available here: https://www.microsoft.com/en-us/research/uploads/prod/2016/1...

franz899 · on Jan 20, 2020

I recommend "Practical TLA+" too; it introduces TLA+ in a more approachable way than other resources.

Link to the book: https://www.apress.com/gp/book/9781484238288

dang · on Jan 20, 2020

A thread from 2014, presumably pre-publication:

https://news.ycombinator.com/item?id=8096185

A bit more: https://news.ycombinator.com/item?id=9289890

Related from 2015: https://news.ycombinator.com/item?id=9287426

_cairn · on Jan 20, 2020

Does anyone have resources that talk about writing a TLA+ model for an existing software system as opposed to designing a new system using TLA+ as the design reference?

lemmster · on Jan 20, 2020

Below are a few specs related to refactoring/rewriting the TLA+ model checker to scale to more cores:

* https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/...

* https://github.com/lemmy/PageQueue

hwayne · on Jan 20, 2020

Check out Marianne Bellotti's work. She recently gave a talk at Strange Loop on writing specifications of existing systems: https://www.youtube.com/watch?v=oMSmkRGzQ64

_cairn · on Jan 21, 2020

Thank you! I will check this out.

agentultra · on Jan 20, 2020

I introduced TLA+ to help understand underlying components in the OpenStack ecosystem and at small companies to solve thorny problems. It's well worth learning as it gives you a very different perspective on the design and construction of systems.

In addition to the books and sites mentioned here, there's also a video course with tutorial content produced by the creatore of TLA+: https://lamport.azurewebsites.net/video/videos.html

exdsq · on Jan 20, 2020

Hey! It sounds like you're using TLA+ in a consulting style role within small companies? How did you manage to get into that?

jayd16 · on Jan 20, 2020

How do you go from something like TLA+ to a common language and keep the correctness? Are assertions the only way?

martythemaniak · on Jan 20, 2020

One thing I'm wondering is whether (or rather, what kind of) bugs arise when you try to translate a model (described as "exhaustively tested pseudocode" elsewhere in this thread) into a real system. For example, if there is if-branch somewhere that gets missed etc?

lemmster · on Jan 20, 2020

I am the engineer who translated https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/... into Java (https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/...). The translation took me about half a day. Later, while running scalability experiments, a correctness bug intermittently showed up, on which I spent approximately two weeks to find the root cause. The two weeks included fixing around a dozen bugs in code that was not specified in TLA+. However, none of these bugs were correctness bugs and unrelated to the correctness bug. Eventually, I translated the TLA+ spec into input for Java Pathfinder, which found the cause of the correctness bug immediately: The Java translation contained an off-by-one bug in a for loops (TLA+ is one-indexed while Java is zero-indexed). I am still convinced that a code review would have found the bug and that JPF was a very heavyweight substitute for a fresh pair of eyes. In other words, the bugs that get introduced during the translation phase are shallow and easily corrected.

Note that no other bugs have been reported for the code since.

zozbot234 · on Jan 20, 2020

A TLA+ specification is generally a simplified model of a real system design. So issues can definitely arise when building the real thing, but model testing on a TLA+ spec can definitely help with quickly surfacing bugs that would exist even in a simplified design. Just don't expect it to do more than that.