Hacker News new | ask | show | jobs
by nonsens3 3469 days ago
Sorry for the offtopic, but have you ever used TLA+ to verify a non-trivial piece of production software? I recently found out about it, and was wondering if there are good public use-case details on it! I know Amazon has used it for some of their production systems, such as S3, and DynamoDB, but couldn't find many details.
2 comments

This is the one paper I know of: http://research.microsoft.com/en-us/um/people/lamport/tla/am...

I know of some small scale use at Facebook (where I work), where it has been used to verify a non-trivial protocol under various combinations of failures.

I've been referred to this paper before when asking this question.

Whilst it does discuss that Amazon used TLA+ in nontrivial circumstances, it's still a long mental leap to me to get to "TLA+ for AWS S3 looked like this, and this is the sort of bug it helped quash".

It'd be great to see some of that sort of thing.

If you want to see what TLA+ specifications of distributed systems look like, there's this lecture series, each on a different distributed algorithm: https://github.com/tlaplus/DrTLAPlus

The kind of bugs you find when verifying distributed systems is, if this machine was in that state and that message was in transit, and then the other machine failed, then data was lost".

Basically, because formal verification of this kind ensures the specification is correct, it will find any kind of bug there is. But it's those bugs that are either a result of a severe flaw in the design, or bugs that would be hard to find in testing but their results may be serious, that make this worthwhile.

Thanks. This is also the paper I was referring to.
Yes (well, it's not in production just yet; it's a fairly complex distributed database, for which I verified consistency in the face of faults), and I gotta say -- I enjoyed every minute of it!

It has some objective qualities and some subjective ones. The objective ones (as confirmed by others) are that it's easy to learn and you can get productive very fast. I immediately started specifying the large project after two weeks of going through Lamport's tutorial[1]. The other objecive benefit is that it has a model checker. Deductively verifying such large specifications (while possible) is simply infeasible (or, rather, unaffordable and an overkill) -- regardless of the tool you use.

The subjective benefit is that the conceptual model quickly clicked for me, personally. I've never liked the aesthetics of modeling arbitrary computations as functions, and the TLA formalism handles sequential, concurrent and parallel computations in the same, extremely elegant way (and allows using the same proof technique regardless of the kind of algorithm, if you want to use the proof assistant). It made me understand what computation is, and how different computations relate to one another, a lot better. The concept -- without the technicalities of TLA+ -- is nicely overviewed in [2].

That said, formal methods are never easy because rigorous reasoning about complex algorithms isn't easy, but it beats hunting down bugs (especially in distributed systems) that are very hard to catch and may be catastrophic; it is also very satisfying. I've found TLA+ to add very little "accidental complexity" to this difficult problem.

[1]: https://research.microsoft.com/en-us/um/people/lamport/tla/h...

[2]: https://research.microsoft.com/en-us/um/people/lamport/pubs/...

Is the database and/or specification open source? A formal verification of a distributed system sounds very interesting.
No; not yet, at least. But this is a lecture series about using TLA+ for distributed algorithms: https://github.com/tlaplus/DrTLAPlus

There's also a Coq framework called Verdi (http://verdi.uwplse.org) for formalizing distributed systems, but I don't know much about it.

Nice, thanks!
Thanks a lot! I will try the `hyperbook` as a starting point.