| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by EdwardDiego 1903 days ago

Okay, I can see that point, but is it worth the additional latency between broker and Bookie?

> I'm pretty surprised Twitter didn't see benefit from doing this if they have multiple Kafka clusters with different use cases.

Yeah, I think they were too tbh. I wish I could delve more into what they experienced beyond that single blog post I linked.

2 comments

rad_gruchalski 1902 days ago

> Okay, I can see that point, but is it worth the additional latency between broker and Bookie?

It might depend on what you're ingesting and how much. Being able to independently scale ingest and storage is a good alternative to have. It's not only ingest though. It's also consumption. As it stands, having a parallel consumer over a large partition spanning several GBs also requires tons of RAM because a segment must be loaded into memory. In that sense, reprocessing historical data is pretty difficult. There's a lot of complexity hidden in additional Druid, HDFS installations or shoe-horned object storage with their own indexing to support access to historical data semi-fast.

link

EdwardDiego 1902 days ago

> having a parallel consumer over a large partition spanning several GBs also requires tons of RAM because a segment must be loaded into memory.

I don't know too much about Kafka's internals, but that's my not experience of reading several terabytes of data from a Kafka topic. Memory didn't blow out, although we did burn through IOPs credits.

link

rad_gruchalski 1902 days ago

Good remark. When there's one consumer or multiple consumers hanging on roughly to the same offset (using the same memory mapped segment). Having many consumers hanging on widely different offsets will cause many segments sitting in RAM.

Edit: Kafka apparently does not store a complete log segment in memory, only parts but having many consumers may lead to a lot of churn or a lot of memory consumed. Maybe this is getting better.

link

skyde 1902 days ago

Latency is actually reduced compared to kafka because when you have 3 replica you just need to wait for a response of the fastest of the 3 replica instead of waiting for a response of the slowest (master) of the 3 replica.

link