Hacker News new | ask | show | jobs
by jeffbee 1601 days ago
I find these articles (and there are dozens of them) somewhat frustrating because they fundamentally misrepresent the role of time sync in distributed databases like Spanner. Spanner doesn't get correctness from tight synchronization, it gets low latency from that. Spanner's correctness comes from estimating the clock error of cluster participants and evicting them when necessary. You can have the latter property without the former, and having the former property without the latter is useless.
1 comments

Spanner gets both correctness and low latency from tight synchronization. They do COMMIT_WAIT, meaning wait for the max clock skew to pass. But the max clock skew without TrueTime will be around 500ms (which it impractical to wait out). So, the 7ms is lower latency, but 500ms is impractical (more so than just calling it higher latency). And any other technique to drop latency (without HLC) will violate correctness.

Disclosure: co-founder/CTO of YugabyteDB project

fwiw i think you’re both saying something very similar. true time has to be correct about max skew in order for it to not break the assumption spanner is built upon. you could also use a looser time bound and still have correctness, but end up with a database that is useless to most/all customers
Interesting. How does this differ from HLC in practice - in the article you say you use a 500ms max skew for HLC?
The difference is in how the regular path (exercised most of the time) vs an edge case when there is a conflict (typically in larger clusters with a pathological access pattern) works. With TrueTime, the latency is always 7ms and no issues in the pathological cases. With HLC, the latency is lower in most cases, but high in the pathological cases (when it can be 500ms), but these should not matter for many use cases.
What do you think of RIFL/TAPIR?
Thanks for pointing this out, was not aware of TAPIR. Will take a look, seems pretty interesting.