Hacker News new | ask | show | jobs
by barrkel 1546 days ago
String columns are expensive to compare, certainly more expensive than integers.

ClickHouse has a low cardinality option when defining a column and it's particularly helpful for strings, especially those which are logically enums or a denormalized lookup from a small source table.

https://clickhouse.com/docs/en/sql-reference/data-types/lowc...

In essence it stores integers instead of strings (or whatever) so it can use a fast int comparison when you're using equality conditions on the column.

If you want fast scanning you don't want joins. So you want to denormalize attributes that you'd normally reach for across a join. But if you can't denormalize those attributes - e.g. they change (so they'd be prohibitively expensive to update in ClickHouse) - then you can store the primary key in ClickHouse, calculate your conditions separately, then use an IN check.

Let me give an example. Suppose we have a Label table in a relational database and an Event table in ClickHouse and we want to find events which have a particular label name (or owner, or possibly other attribute even more joins away).

Instead of using foreign tables in ClickHouse and joining Event to Label, or importing the Label table, you can evaluate your condition in your relational database to get a list of label ids, and then use an IN (label_id, ...) in the ClickHouse query. That way you avoid the join.