|
Comparison on Clickhouse, also runs in about 30-40ms, however there's no indexing being used and this is a full-table scan. create table if not exists test_table
(
id UInt64,
text1 String,
text2 String,
int1000 UInt64,
int100 UInt64,
int10 UInt64,
int10_2 UInt64
)
engine = MergeTree()
order by (id)
;
insert into test_table
with
repeat('b', 1024) as one_kib,
repeat('b', 255) as bytes_255
select
number as id,
one_kib,
bytes_255,
rand() % 1000 as int1000,
rand() % 100 as int100,
rand() % 10 as int10,
rand() % 10 as int10_2
from numbers(10e6)
;
> select count(*) from test_table where int1000 = 1 and int100 = 1;
┌─count()─┐
│ 9949 │
└─────────┘
1 row in set. Elapsed: 0.034 sec. Processed 10.00 million rows, 160.00 MB (290.93 million rows/s., 4.65 GB/s.)
The same table but with 1B rows instead, runs in ~1800ms > select count(*) from test_table where int1000 = 1 and int100 = 1;
┌─count()─┐
│ 999831 │
└─────────┘
1 row in set. Elapsed: 1.804 sec. Processed 1.00 billion rows, 16.00 GB (554.24 million rows/s., 8.87 GB/s.)
[1] Converted the table create and insert logic from here: https://github.com/sirupsen/napkin-math/blob/master/newslett... |
That first steatement about "no indexing being used" is not quite correct if the query is run exactly as you show in your nice example.
ClickHouse performs what is known as PREWHERE processing which will effectively use the int1000 and int100 columns as indexes. It scans those columns and knocks out any blocks (technically granules containing by default 8192 rows) that do not values that match the filter conditions. It then performs a scan on the remaining blocks to get the actual counts.
PREWHERE is effective because columns are compressed and scans are fast. If there's any pattern to the filter columns (for example monotonically increasing counters) or their values have high cardinality PREWHERE processing will remove a large number of blocks. This will make the rest of the scan far faster.
In your dataset it may not be especially efficient because you use random values, which don't necessarily compress well, and the values will appear in many blocks. It works much better in real datasets where data are more correlated.
EDIT: PREWHERE is much faster in cases where you are doing more complex aggregation on many columns. Counts of course don't need to scan any extra values so it's not helpful in this case.
p.s. Scans are ridiculously fast.