Hacker News new | ask | show | jobs
by lizthegrey 1662 days ago
1.5M messages/sec, average message size 1kb pre compression, 300 bytes post compression/batching.

the problem was that we were really really disk limited before for keeping the 48 hour window of data, having to keep everything on NVMe or EBS was astoundingly expensive.

but yeah, we run it all off 6 brokers now.

2 comments

If I understand correctly, there were:

- issues with tail latency and cost when using gp2

- issues with generally bad performance when using st1

- issues with reliability when using gp3 (as an early adopter of aws "GA" product)

- issues with insufficient disk space when using local-attached nvme

- issues with confluent licensing cost

And tiered storage solves all of that.

The thing is, I have not seen kafka struggling with disk performance when running on gcp pd-ssd. Perhaps even pd-balanced would do the trick, as indicated by rmb938's comment. I am glad that you guys finally landed on a boring solution now, but things have been rather boring for years with another cloud provider. Perhaps there is no material impact from the high tail latency when using gp2, and you just needed a better contract negotiator? Surely the tail latency would be worse now whenever data need to be pulled from S3?

Oh, believe me, we have hired Corey Quinn (Duckbill Group). AWS budged on some things, but not on the EBS cost.
Maybe it's worth trying out GCP for a POC cluster? Downside is they don't have any ARM instances but some back of the napkin math does show that an equivalent setup of using 6x im4gn.2xlarge in AWS to 6xn2-standard-8 and 6x3750gb pd-balanced ssds is roughly around the same cost and disk perf, could be a bit cheaper with their AMD instances instead of Intel. If you compare it to a gp2 disk it's roughly 4x faster, but the same per for local disk on im4gn.2xlarge.

I also have had decent success with getting committed use and reservations on GCP ssd compared to other cloud providers.

It would be for GCP based customers, but we are a telemetry platform and making our (majority) AWS based customers pay 0.08 per GB to egress to us is a non-starter for them :/
Did you look at any of the other solutions such as fq? 300 bytes is a solidly small size. I’m guessing Kafka has gotten faster since this doc was published, but might be worth investigating. https://github.com/neophenix/StateOfTheMQ/blob/master/state_...
Ah, https://github.com/circonus-labs/fq

It was less mature in 2016 when we made the original technology choice (and is still, I'd say, probably not a Boring Technology today). With batching, Kafka is plenty fast for us!