|
|
|
|
|
by jonomacd
2700 days ago
|
|
Yeah, it was quite frustrating trying to figure out what was going on. Up until replication was released, it makes it a real non-starter for a lot of use cases. With replication you can combat the problem and it does give you great performance (when it isn't giving you random errors). |
|
As to those hiccups, unless they last for minutes or hours, in which case you might have a case of data corruption (BT is paranoid and rereads data right after any kind of compaction), most of the time they might be explained by, in approximately increasing order of badness:
- an orderly tablet server restart, e.g. for a binary update or because a Borg machine is undergoing a kernel update
- a tablet server crash: a software crash or a hardware one (this is bad, because there's a timeout that needs to be hit before a new server can take over the shard. The BT paper has details about the recovery protocol.)
- heavy load on the master, while either of the previous two are happening
- I don't think any of the various types of compactions would normally block reads/writes, but with some abnormal traffic patterns you might be able to make the tablet server suffer
- slowness at the lower layer, GFS/Colossus (although it mitigates a bit against this by having two separate log files into which it can write)
- Chubby outage
- power outage affecting a good chunk of or the entire cluster