|
|
|
|
|
by dharmashuklaMS
3330 days ago
|
|
Cosmos DB codebase consists of multi-million lines of C++ code. It is fully asynchronous and like all large scale (stateful) distributed systems, it is also extremely complex. The database engine of Cosmos DB (including the b-tree and the log structured storage engine) is fully latch free; the engine, resource governance subsystem, global distribution infrastructure, partition management etc -- are all deeply integrated. Each of these subsystems have extremely complex state machines which are hard to describe with the required degree of precision using the English language. Any correctness bug that gets introduced because of lack of precision, can potentially lead to data loss, corruption, partial or complete failures of the entire service. This is where TLA+ comes in. Background:
Dr. Leslie Lamport's work has been a constant source of inspiration for the Cosmos DB team. A few of the engineers on the team had learnt TLA+ initially on their own and started seeing its benefits. Subsequently, other members of the team started applying it as well. Leslie had also taught a fantastic class on TLA+ (across Microsoft), which engineers on the Cosmos DB team attended. It was a wonderful, once in a lifetime opportunity for the team to learn TLA+ from Leslie. To be clear, Leslie personally didn't write any of the TLA+ specs for Cosmos DB. It was Cosmos DB engineers who wrote the TLA+ specs to specify & verify the design (incl. consistency models). The net result is that TLA+ made Cosmos DB a more robust system, which is crucial to offer strict and comprehensive SLAs encompassing availability, consistency, throughput and latency at the 99th percentile. Further, writing TLA+ specs for the five consistency models which we have exposed (as well as many that we have experimented, internally) enabled us to precisely define the semantics for each of the consistency models. This in-turn enables developers building apps on top of Cosmos DB, to rely on the well-defined semantics. Hope this is helpful. |
|
Still hope you write a whitepaper. The AWS paper is super valuable when making the case for TLA+ use in industry.