Hacker News new | ask | show | jobs
by martincmartin 3475 days ago
This is the one paper I know of: http://research.microsoft.com/en-us/um/people/lamport/tla/am...

I know of some small scale use at Facebook (where I work), where it has been used to verify a non-trivial protocol under various combinations of failures.

2 comments

I've been referred to this paper before when asking this question.

Whilst it does discuss that Amazon used TLA+ in nontrivial circumstances, it's still a long mental leap to me to get to "TLA+ for AWS S3 looked like this, and this is the sort of bug it helped quash".

It'd be great to see some of that sort of thing.

If you want to see what TLA+ specifications of distributed systems look like, there's this lecture series, each on a different distributed algorithm: https://github.com/tlaplus/DrTLAPlus

The kind of bugs you find when verifying distributed systems is, if this machine was in that state and that message was in transit, and then the other machine failed, then data was lost".

Basically, because formal verification of this kind ensures the specification is correct, it will find any kind of bug there is. But it's those bugs that are either a result of a severe flaw in the design, or bugs that would be hard to find in testing but their results may be serious, that make this worthwhile.

Thanks. This is also the paper I was referring to.