Hacker News new | ask | show | jobs
by hodgesrm 1469 days ago
Hi,

You don't mention how much data you have, what the arrival rate is, and how long you would keep it.

You did mention you are familiar with ClickHouse. For datasets under a few billion rows you don't need any special materials views for specific dimensions. Put everything in a single table with columns for all variants and an identifying column for the type of data. Just ensure you make good choices on datatypes, compression, and primary key/sort order. In this case you can then just apply more compute to get good results.

ClickHouse can handle 1K columns without much trouble.

edit: clarify use of a single table.

1 comments

Yes, we've done some benchmarks using Clickhouse with the same design that you just mentioned (1 single table that contains all relevant dimensions and metrics)

In our benchmark, we tried aggregating around 1 billion rows of raw data (2 months data) using count exact distinct -> could achieve around 50-60 seconds. If we use the HLL, the query can be finished around 20-30 seconds.

For the retention part, we're planning to keep it 1 year of data, so around 6 billion rows.