Hacker News new | ask | show | jobs
by jack9 3912 days ago
Do you using batching to reach that scale of throughput? Streams sometimes are pre-aggregated data and it wasn't clear on if you maintained the granularity through the changes.
2 comments

I can't speak for their implementation but batching is not necessary. Stream processing complex JSON documents and storing the documents to disk at rates of 500k documents/second per server is demonstrably achievable on some scale-out systems.

The internal architectures make an enormous difference in throughput. A proper high-performance stream processing engine does not look anything like the "Hadoop in RAM" style model.

> Stream processing complex JSON documents and storing the documents to disk at rates of 500k documents/second per server is demonstrably achievable on some scale-out systems

So is it per server or scaled out? I thought SSDs have capped around 100k discrete per second (P/E aka write cycles).

Can you give an example? I've been unable to practically reach more than a scale of 10k/sec/server using a number of technologies and combinations to collect from socket, parse json and write to socket. That's just my specific use case.

Looking at the top end of Intel's SSD lineup I see that they have a product that advertises up to 175k IOPS of random 4K writes. Is this what you are referring?

The product is the 2TB P3700.

There's no batching, we have a 1 to 1 mapping of kafka messages to measurements we receive from our api, that could change though over time. Superchief just reads the messages and each message is passed off to another thread for processing.