Hacker News new | ask | show | jobs
by linkmotif 2966 days ago
> or deal with a repartition of all data.

Why is this difficult? Is mirroring clusters operationally problematic? If one cluster gets too small, in theory can’t you spin up another cluster, and mirror the first onto the second. Then when they are in sync direct writes to the new cluster?

2 comments

That sounds possible, but it would both involve downtime or potentially ordering and data duplication issues if you mess it up. Dynamic expansion and contraction of partition count should be a feature that doesn’t require recreating the entire cluster like essentially every other data product in the world.
There's no downtime in mirroring a cluster with MirrorMaker, and it's become common to isolate producers from consumers to separate clusters, see https://medium.com/netflix-techblog/kafka-inside-keystone-pi...

Ordering of messages is completely unaffected, it's the routing of future events that's affected when you increase partition count. This is critical for some use cases (windowing of data for analytics purposes, for example) but irrelevant to others.

Data duplication issues? This sounds like FUD also, but common guidance is to design your events to be idempotent, or utilize Kafka's new exactly-once delivery.

You can expand partition counts for a topic dynamically.

You can't currently decrease partition counts, because given the current design that could orphan both data and consumers.

The architectural pendulum is starting to swing away from co-location of storage and compute (the trend of the last 10+ years) to decoupling of storage and processing to avoid exactly these issues, but legacy architectures hang on for a while.

In the streaming and messaging space, Apache Pulsar (pulsar.apache.org) is a more recent solution that has an architecture that decouples processing and storage. That gives you nice properties like independent scaling of storage and processing, infinite data retention, dynamic resizing and others.

What pendulum do you see? From here, architectural patterns are clearly converging on a "distributed mainframe" model between containerization and lambda/kappa architectures...

I think Joyent's Manta was ahead of its time in colocating compute and storage and I suspect we'll see more along this vein with the recent open sourcing of FoundationDB.

I was thinking more specifically of the internal architectures of data processing platforms, especially the categorizations that emerged from the MPP database world. The "shared nothing" architecture has been dominant in databases (and is also the core architecture of Hadoop), designed around "co-locating data and compute". Kafka largely follows that architecture as well, using local disk on the compute nodes as its persistent storage layer.

A lot of new data processing platforms, from Snowflake in the data warehouse world to AWS Athena to Apache Pulsar in the broader data processing world, have moved to decoupled architectures.

Containerization and container management frameworks (e.g. Kubernetes) certainly do change the meaning of "local" storage, will be interesting to see how that plays out.

You can but even in TB's of data never mind PB's it can take weeks to sync across.

Also running two clusters to handle large volumes of data, it's big money. Even a small modest cluster with around 20TB+ of data was north of 30K a month on AWS. That's a full app cluster though with consumers/producers aswell as brokers.