Hacker News new | ask | show | jobs
by iknownothow 1168 days ago
The total dataset size doesn't seem much at ~1TB. So you can use Clickhouse for fast analytics like others have suggested. Consider hosting on Hetzner since they have cheap Nvme disks.

The biggest problem you're going to face is ingestion of these events during peaks at 500k events per minute. You can't ingest them individually into Clickhouse or most other databases. So unfortunately you will have to add one additional streaming layer to cache these events so you can create batches of events once every few seconds and ingest a big batch of 1k-10k events into Clickhouse. AWS API Gateway + Kinesis is operationally easy to set up and quite cheap and should be able to handle your peak load. Afterwards use a Lamda to batch >1000 events from Kinesis and insert into Clickhouse. I've never tested this last part so I'm not sure how it will work out.

It'd be nice to know what you eventually go with. Please send me a message if you can of what you've finally chosen.

1 comments

ingestion of these events during peaks at 500k events per minute. You can't ingest them individually into Clickhouse or most other databases.

Turn on async_insert or use a Buffer table engine and you can easily insert them individually into ClickHouse

That's interesting! I don't have much experience with Clickhouse, especially not in the last two years. I'll have to try this out myself. That's a pretty incredible if it can handle batching internally.