Hacker News new | ask | show | jobs
by ryanworl 2966 days ago
One potential problem is a Kafka partition’s size is limited to the size of the smallest machine in the replica set. This means if you want infinite retention you have to potentially over-partition so they never get too big, keep buying bigger machines and disks, or deal with a repartition of all data.

An simple way to get around this problem is dumping messages into a file and putting that file in S3 named something like “topic-partition-offset” where offset is the offset of the first message contained within that file. You can then read those forward starting from offset zero and go until you reach the end, then start reading from Kafka for recent data.

The drawback is this isn’t integrated with Kafka so you’re now maintaining what is effectively two different systems for the same data. It also means the key-based compaction won’t work either and you’d have to re-implement that on top of the files in S3 as well.

2 comments

I've used LVM in AWS to present a single volume to Kafka >16TB which is their max EBS size. They report you can attach up to 40 EBS volumes.

Growing LVM with XFSs has worked well, 0 downtime and around 60 seconds.

Allows you to over provision just enough you do not have to babysit the drives or pay $$$ for unused disc.

If you stripe the volumes you'll also distribute your IOPs in AWS.

Outside AWS LVM still applies. Kafka's JBOD is useless without easy / auto rebalancing.

This week onsite at a client's I discovered ScaleIO which can present up to a 1PB volume and does clever sharding/replication in the background.

> or deal with a repartition of all data.

Why is this difficult? Is mirroring clusters operationally problematic? If one cluster gets too small, in theory can’t you spin up another cluster, and mirror the first onto the second. Then when they are in sync direct writes to the new cluster?

That sounds possible, but it would both involve downtime or potentially ordering and data duplication issues if you mess it up. Dynamic expansion and contraction of partition count should be a feature that doesn’t require recreating the entire cluster like essentially every other data product in the world.
There's no downtime in mirroring a cluster with MirrorMaker, and it's become common to isolate producers from consumers to separate clusters, see https://medium.com/netflix-techblog/kafka-inside-keystone-pi...

Ordering of messages is completely unaffected, it's the routing of future events that's affected when you increase partition count. This is critical for some use cases (windowing of data for analytics purposes, for example) but irrelevant to others.

Data duplication issues? This sounds like FUD also, but common guidance is to design your events to be idempotent, or utilize Kafka's new exactly-once delivery.

You can expand partition counts for a topic dynamically.

You can't currently decrease partition counts, because given the current design that could orphan both data and consumers.

The architectural pendulum is starting to swing away from co-location of storage and compute (the trend of the last 10+ years) to decoupling of storage and processing to avoid exactly these issues, but legacy architectures hang on for a while.

In the streaming and messaging space, Apache Pulsar (pulsar.apache.org) is a more recent solution that has an architecture that decouples processing and storage. That gives you nice properties like independent scaling of storage and processing, infinite data retention, dynamic resizing and others.

What pendulum do you see? From here, architectural patterns are clearly converging on a "distributed mainframe" model between containerization and lambda/kappa architectures...

I think Joyent's Manta was ahead of its time in colocating compute and storage and I suspect we'll see more along this vein with the recent open sourcing of FoundationDB.

I was thinking more specifically of the internal architectures of data processing platforms, especially the categorizations that emerged from the MPP database world. The "shared nothing" architecture has been dominant in databases (and is also the core architecture of Hadoop), designed around "co-locating data and compute". Kafka largely follows that architecture as well, using local disk on the compute nodes as its persistent storage layer.

A lot of new data processing platforms, from Snowflake in the data warehouse world to AWS Athena to Apache Pulsar in the broader data processing world, have moved to decoupled architectures.

Containerization and container management frameworks (e.g. Kubernetes) certainly do change the meaning of "local" storage, will be interesting to see how that plays out.

You can but even in TB's of data never mind PB's it can take weeks to sync across.

Also running two clusters to handle large volumes of data, it's big money. Even a small modest cluster with around 20TB+ of data was north of 30K a month on AWS. That's a full app cluster though with consumers/producers aswell as brokers.