| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by martincmartin 2836 days ago

I worked on LogDevice at FB until about 6 months ago.

I'm not that familiar with Kafka, but in general LogDevice emphasizes write availability over read availability. There are many applications where data is being generated all the time, and if you don't write it, it will be lost. However, if reading is delayed, it just means readers are a little behind and will need to catch up.

So, when a sequencer node dies and we need to figure out what happened to the records that were in flight -- which ones ended up on disk & can be replicated, what the last record was -- LogDevice still accepts new writes. However, to ensure ordering, these new writes aren't visible to readers until the earlier writes are sorted out.

1 comments

ivankelly 2835 days ago

What happens with a sequencer which appears to fail but hasn't really, and then comes back up after the process to figure out inflight records has completed? If that sequencer receives a record for a log, will it be able to write it to the storage nodes? I.e. is there any fencing mechanism to tell the storage nodes that the epoch has been bumped, so don't access writes for that epoch anymore?

link

sh00s 2835 days ago

yes, in LogDevice it's called "sealing". However, as it stands, a newly activated sequencer won't wait for sealing on the old epoch to complete before taking new writes - in the tradeoff between write availability and consistency LogDevice picks higher availability. Blocking new writes until sealing is complete, however, should be fairly easy to integrate into LogDevice as an option.

link

ivankelly 2832 days ago

Does it block reads until sealing is complete? How many nodes in the nodeset have to respond before sealing is complete? [NodeSet] - [ReplicationFactor] + 1?

link

sh00s 2832 days ago

Yes, reads are not released (i.e. are blocked) until sealing is complete. We call the minimal set of nodes sufficient to serve reads for a log (the same set is needed for sealing to complete) an f-majority.

For a simple case where placement of data is location-agnostic, indeed the definition of f-majority is n - r + 1, where n is the nodeset size, and r is the replication factor.

However, if your replication property, is say, "place 3 copies across 3 racks", then the definition of f-majority becomes more complicated - e.g. having all nodes in the nodeset respond minus two racks will also satisfy it.

link

ivankelly 2832 days ago

> Yes, reads are not released (i.e. are blocked) until sealing is complete.

Which are the cases where consistency is compromised then? If a client of the log needs consistency, it needs to ensure that it has seen all previous updates to a log before making a new update, which implies a read.

> However, if your replication property, is say, "place 3 copies across 3 racks", then the definition of f-majority becomes more complicated - e.g. having all nodes in the nodeset respond minus two racks will also satisfy it.

Sure, the aim being that no write can be successfully acknowledged by enough replicas to complete the write.

link

sh00s 2832 days ago

> Which are the cases where consistency is compromised then? If a client of the log needs consistency, it needs to ensure that it has seen all previous updates to a log before making a new update, which implies a read.

Consistency in a more general sense than just read-modify-write consistency. If you have sequencers active in several epochs at the same time accepting writes, the records may end up being written out of order, and there would be a breakage of the total ordering guarantee.

link

ivankelly 2832 days ago

> Consistency in a more general sense than just read-modify-write consistency. If you have sequencers active in several epochs at the same time accepting writes, the records may end up being written out of order, and there would be a breakage of the total ordering guarantee.

But given that reads are blocked on all sequencers before the current one, this should still provide total order atomic broadcast, unless a single client can connect to a sequencer with a lower epoch than one it has already seen.

link