Hacker News new | ask | show | jobs
by taywrobel 1187 days ago
One issue I’ve encountered is over-partitioning to handle a spike in traffic.

I.e. an event occurs which causes an order of magnitude more messages than usual to be produced for a couple of hours, and because ingest and processing flows are out of whack, a backlog forms. Management wants things back in sync ASAP, and so green lights increasing the partition count on the topic, usually doubling it.

In an event driven architecture that is fairly well tuned for normal traffic this can have the same downstream effect, and those topics up their partition counts as well in response.

Once anomalous traffic subsides, teams go to turn down the now over-partitioned topics only to learn that that was a one way operation and now they’re stuck with that many partitions, and the associated cost overhead.

Also if I see another team try to implement “retries” or delayed processing on messages by doing some weird multi-topic trickery I’m going to lose my mind. Kafka is a message queue, not a job queue, and not nearly enough engineers seem to grok that.