|
|
|
|
|
by barrkel
1547 days ago
|
|
ClickHouse is awesome, but most of the benefits come from columnar storage and you need to design around that. Be aware of how the thing works and how computer architecture works, because sympathy with the machine is what reaps rewards. You want to minimize the number and size of columns touched when filtering and aggregating. If you need source data, store it relationally or in a document store and only select the key from CH. Don't put JSON in CH, fat columns don't make sense. And CH can be just as slow as MySQL if you select a whole wide row but only apply predicates to a handful of columns. Only touch the columns you need. Joins are super expensive because it costs a whole lot of instructions to look up a row on a per value basis. CH can use vectorized operations to eliminate or aggregate multiple "rows" with instruction level parallelism because the column data is contiguous. Joins are going to be an order of magnitude slower, just with memory latency randomly hopping around a hash table. Insert data prejoined; use the low cardinality string column, substitute (i.e. precalculate) conditions on low cardinality relations with integer IN tests, denormalize high cardinality relations. Partitions and various other storage level tricks are a way to eek out better perf for mutations when they're needed. Rebuild a subset of the data and swap it in and out. This is common on Hadoop based columnar stores like Parquet and last I looked at CH it was getting better ways to shuffle partitions around, "attach partition from" and so on. |
|
On JOINs - again, requirements bite us in different ways. Product analytics tool data ingestion pipelines can get quite complicated due to needing to handle merging anonymous and signed in users and user properties changing over time. Handling that via JOINs is as a go-to-market helps avoid that upfront cost by centralising the logic in SQL, but indeed does come with a significant cost in scalability. Delaying in turn allows you to be building tools users need. That said every loan needs to be paid at some point and we're currently knee deep in re-architecting everything to avoid these joins.
Also note that JOINs don't work the way you described from our experience - rather the right hand side of the join gets loaded into memory. The bottleneck there is memory pressure rather than I/O with a good ORDER BY on the table.
All that said, what a great summary of all the different things to keep an eye on. Thanks for reading and sharing your thoughts!