Hacker News new | ask | show | jobs
by thinkersilver 2838 days ago
The use cases overlap neatly with Kafka's. Everything from it's usage of zookeeper, time-and-storage-based retention tuning are similar

The announcement does not clarify the reason they use this over kafka. Is it because Kafka doesn't scale to millions of logs on a single cluster or is it because kafka is not sympathetic to heterogeneous disk arrays containing SSD and HDD. I strongly suspect it may be latency of writes at scale but this is pure speculation.

I don't know. If I understand why anyone might use this I'd contribute to building language bindings for the APIs.

2 comments

Some strengths of LogDevice include:

- It's designed to work with a large number of logs (roughly equivalent to partitions in Kafka), hundreds of thousands per cluster is common.

- Sequencer failover is very quick, typical failover time when a sequencer node fails is less than a second.

- It supports location awareness and can place data according to replication constraints specified (e.g. replicate it in 3 copies across 2 different regions and 3 racks).

- Because of non-deterministic data placement, it is very resilient to failures in terms of write availability.

- If a node/shard fails, it detects the failure and rebuilds the data that was replicated to failed nodes/shards automatically

> Because of non-deterministic data placement, it is very resilient to failures in terms of write availability.

I am happy to expand more on this point.

We have this concept of "node set" of a log which is the set of storage nodes available to receive record copies sent by the sequencer. It is typically made of 20-30 nodes in typical deployments at Facebook. Write availability is maintained as long as enough storage nodes in the node set are available to accept copies. When storage node failures are detected, the sequencer can just exclude these nodes from the list of potential recipients for new records. It does not need to update a view that needs to be synchronized with readers, which is a heavy-weight operation. This model allows preserving high write availability even if many nodes in the node set are unhealthy.

Additionally, this record copy placement flexibility allows the sequencer to quickly route around latency spikes on individual storage nodes, which helps guarantee low append latency.

> Is it because Kafka doesn't scale to millions of logs on a single cluster

I doubt that's it, since Kafka can certainly do that.

Millions of separate topics on a single Kafka cluster? The way it's designed requires opening files for all of those topics and their partitions so good luck if you're trying that. You'll run out of file handles, then memory, and then the disk access will completely freeze up.
I didn't think we were speaking of millions of topics here; only millions of logs. You can certainly have logs numbering in the millions using a single topic. Mux/demux would have to happen at the producer/consumer side, of course.
Do you mean log segments then? In that case I don't see what's special about it because that's just rolling files and all of these systems can handle millions that way.

As far as millions of topics, if you have to do it at a logical layer yourself, then you might as well use a system that supports it natively.

The logs in LogDevice also have an independent lifecycle, which your solution doesn't allow.
A log in LogDevice is roughly equivalent to a Kafka partition.
It does not, I've lost alot of time profiling Kafka perf issues against clusters on the exact same hardware with exact same traffic but with a 3000% throughput difference. The root cause was one cluster had a lot of empty test topics

Try benchmarking Kafka from 0 partitions to a few thousand partitions in 100 partition increments. The benchmark only needs to write to a single topic, using their provided producer perf tool while all other topics are inactive with zero data.

As the partitions increase there is a very noticeable drop in throughout that looks to be linear.

Kafka does not handle a large number of partitions well currently, large even being low thousands. It's easy to hit with just a few hundred topics.

Reading between the lines ehen Linkdin and Netflix advertise several clusters, i am predicting/guessing they shard the data.

I didn't think we were speaking of millions of topics or partitions here; only millions of logs. You can certainly have logs numbering in the millions using a single topic. Mux/demux would have to happen at the producer/consumer side, of course.