| HN Mirror

When I was at Google, you were not supposed to serve end users straight out of BigTable. You had to do extra work: request hedging against multiple replicas (Jeff Dean has mentioned this up in public many times, with numbers on long-tail latency impact), some in-memory caching if appropriate, etc. In other words, you had to protect end users from Bigtable: after all, its original target was the web crawling and indexing pipelines. The problem is that, as you both say, it works very well the vast majority of the time, so people tended to get spoiled and/or make assumptions. Which is why Spanner was created.

As to those hiccups, unless they last for minutes or hours, in which case you might have a case of data corruption (BT is paranoid and rereads data right after any kind of compaction), most of the time they might be explained by, in approximately increasing order of badness:

- an orderly tablet server restart, e.g. for a binary update or because a Borg machine is undergoing a kernel update

- a tablet server crash: a software crash or a hardware one (this is bad, because there's a timeout that needs to be hit before a new server can take over the shard. The BT paper has details about the recovery protocol.)

- heavy load on the master, while either of the previous two are happening

- I don't think any of the various types of compactions would normally block reads/writes, but with some abnormal traffic patterns you might be able to make the tablet server suffer

- slowness at the lower layer, GFS/Colossus (although it mitigates a bit against this by having two separate log files into which it can write)

- Chubby outage

- power outage affecting a good chunk of or the entire cluster