| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by otterley 2840 days ago
	> Is it because Kafka doesn't scale to millions of logs on a single cluster I doubt that's it, since Kafka can certainly do that.

2 comments

manigandham 2839 days ago

Millions of separate topics on a single Kafka cluster? The way it's designed requires opening files for all of those topics and their partitions so good luck if you're trying that. You'll run out of file handles, then memory, and then the disk access will completely freeze up.

link

otterley 2839 days ago

I didn't think we were speaking of millions of topics here; only millions of logs. You can certainly have logs numbering in the millions using a single topic. Mux/demux would have to happen at the producer/consumer side, of course.

link

manigandham 2838 days ago

Do you mean log segments then? In that case I don't see what's special about it because that's just rolling files and all of these systems can handle millions that way.

As far as millions of topics, if you have to do it at a logical layer yourself, then you might as well use a system that supports it natively.

link

rakoo 2838 days ago

The logs in LogDevice also have an independent lifecycle, which your solution doesn't allow.

link

sh00s 2838 days ago

A log in LogDevice is roughly equivalent to a Kafka partition.

link

beepbeepbeep1 2839 days ago

It does not, I've lost alot of time profiling Kafka perf issues against clusters on the exact same hardware with exact same traffic but with a 3000% throughput difference. The root cause was one cluster had a lot of empty test topics

Try benchmarking Kafka from 0 partitions to a few thousand partitions in 100 partition increments. The benchmark only needs to write to a single topic, using their provided producer perf tool while all other topics are inactive with zero data.

As the partitions increase there is a very noticeable drop in throughout that looks to be linear.

Kafka does not handle a large number of partitions well currently, large even being low thousands. It's easy to hit with just a few hundred topics.

Reading between the lines ehen Linkdin and Netflix advertise several clusters, i am predicting/guessing they shard the data.

link

otterley 2839 days ago

I didn't think we were speaking of millions of topics or partitions here; only millions of logs. You can certainly have logs numbering in the millions using a single topic. Mux/demux would have to happen at the producer/consumer side, of course.

link