| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arynda 1340 days ago

Comparison on Clickhouse, also runs in about 30-40ms, however there's no indexing being used and this is a full-table scan.

    create table if not exists test_table
    (
        id UInt64,
        text1 String,
        text2 String,
        int1000 UInt64,
        int100 UInt64,
        int10 UInt64,
        int10_2 UInt64
    )
    engine = MergeTree()
    order by (id)
    ;
    
    insert into test_table
    with
      repeat('b', 1024) as one_kib,
      repeat('b', 255) as bytes_255
    
      select
        number as id,
        one_kib,
        bytes_255,
        rand() % 1000 as int1000,
        rand() % 100 as int100,
        rand() % 10 as int10,
        rand() % 10 as int10_2
      from numbers(10e6)
    ;
  
  
  > select count(*) from test_table where int1000 = 1 and int100 = 1;
  
  ┌─count()─┐
  │    9949 │
  └─────────┘
  
  1 row in set. Elapsed: 0.034 sec. Processed 10.00 million rows, 160.00 MB (290.93 million rows/s., 4.65 GB/s.)

The same table but with 1B rows instead, runs in ~1800ms

  > select count(*) from test_table where int1000 = 1 and int100 = 1;

  ┌─count()─┐
  │  999831 │
  └─────────┘

  1 row in set. Elapsed: 1.804 sec. Processed 1.00 billion rows, 16.00 GB (554.24 million rows/s., 8.87 GB/s.)

[1] Converted the table create and insert logic from here: https://github.com/sirupsen/napkin-math/blob/master/newslett...

3 comments

hodgesrm 1340 days ago

> however there's no indexing being used and this is a full-table scan.

That first steatement about "no indexing being used" is not quite correct if the query is run exactly as you show in your nice example.

ClickHouse performs what is known as PREWHERE processing which will effectively use the int1000 and int100 columns as indexes. It scans those columns and knocks out any blocks (technically granules containing by default 8192 rows) that do not values that match the filter conditions. It then performs a scan on the remaining blocks to get the actual counts.

PREWHERE is effective because columns are compressed and scans are fast. If there's any pattern to the filter columns (for example monotonically increasing counters) or their values have high cardinality PREWHERE processing will remove a large number of blocks. This will make the rest of the scan far faster.

In your dataset it may not be especially efficient because you use random values, which don't necessarily compress well, and the values will appear in many blocks. It works much better in real datasets where data are more correlated.

EDIT: PREWHERE is much faster in cases where you are doing more complex aggregation on many columns. Counts of course don't need to scan any extra values so it's not helpful in this case.

p.s. Scans are ridiculously fast.

link

paulmd 1340 days ago

> p.s. Scans are ridiculously fast.

this is really the lesson of SOLR. full-scan all the things, aggregate as you go, broadcast disk IO to multiple listeners.

why do a bunch of 4K random IO when you could full-scan at bus speed? yeah you can make the 4K random IO super fast but that's not where hardware is going, and it's also scalable/clusterable/shardable where RDBMS caps out at one machine and clustering is kinda ugly.

link

yxhuvud 1340 days ago

Huh? That is exactly where hardware is going. What has been missing is the parts in between, the ability to emit enough random IO i parallel to saturate the interfaces.

link

paulmd 1340 days ago

"huh? high clockrates is exactly where hardware is going, we just haven't figured out how to get the silicon to work"

no, that's the opposite of where hardware is going. when was the last time flash latency or DRAM latency significantly improved? literally 15 years ago. Optane is the only improvement that has been made on that front and optane is effectively dead at this point. so actually latency and random IO is going backwards right now.

hardware is going in the direction of sustained block transfers - that is where NVMe is still improving today.

issuing random requests still basically sucks just as much as it did 15 years ago, and doing a lot of them in parallel is just a shitty bandaid patch on the problem. some workloads are irreducibly single-threaded, you simply can't proceed on the logic until you know the last bit of data. being able to do lots of those in parallel is nice, but it's the consolation prize on latency no longer scaling anymore.

so, stop doing random IO and just stream your workset in large blocks and have listeners pick off their bit (retrieve individual records, or perform their aggregations) as it streams by. Effectively, do your disk IO in big sustained transfers and then do the random IO in memory.

Not that memory has improved over the last 15 years either but it's better than doing it on disk.

SOLR is behind most of the big webscale search and commerce systems nowadays. Nobody is doing Amazon on RDBMS-style random IO systems, not even on database-style document systems like mongodb (which is really more pseudo-RDBMS than a true document search system).

Or on the flipside: if you want to do random IO on your SSDs, do random IO on your SSDs - forget the whole filesystem layer, and use a key-value SSD (which do exist). That's what your database provides right now, after all. But RDBMS (or, again, quasi-RDBMS random-IO document stores like Mongo) doesn't play to the strength of SSDs anymore, that's not where they're getting better, and if you want to treat it as a block store then you might as well stream big blocks and not do random IO on your device layer.

https://www.mydistributed.systems/2020/07/towards-building-h...

link

arynda 1340 days ago

> ClickHouse performs what is known as PREWHERE processing > p.s. Scans are ridiculously fast.

Good point, I should have mentioned this was basically a worst-case scenario for Clickhouse as the data layout is totally random (same approach as OP used in their benchmark) and isn't able to utilize any granule pruning, sorting, or skip indexing, but is still able to achieve such remarkable speeds.

link

hodgesrm 1340 days ago

What's cool is that even in this case ClickHouse is still stupid fast compared to most other databases. ;)

link

stingraycharles 1340 days ago

Out of curiosity:

> It scans those columns and knocks out any blocks (technically granules containing by default 8192 rows) that do not values that match the filter conditions

How is that not just a sequence scans? Of course it pre-emptively filters away entire blocks that do not contain the data, but indexes typically work differently: they’re calculated upon write, so that they can be queried really fast.

Is there a detail missing here, e.g. like bloom filters being used or something else that makes it different from a regular sequence scan?

link

Nican 1340 days ago

I know nothing about ClickHouse, but how many cores are being used for these queries? And what is the core processing frequency?

I would think ClickHouse is tuned for analytics workloads, so it will throw plenty of cores at the problem, and not care much for the overhead. Meanwhile, I believe PostgreSQL is more tuned to transactional workloads, where it will not pay the query parallelism overhead, but optimize for multiple parallel workloads.

link

Sirupsen 1340 days ago

Are you aware of a good write-up on how Clickhouse/other columnar databases do the intersection?

link

hodgesrm 1340 days ago

ClickHouse uses a single primary key index, which matches the sort order, plus skip indexes, which knock out blocks to scan. Here's a writeup that explains skip indexes.

https://altinity.com/blog/clickhouse-black-magic-skipping-in...

You can also check out the following webinar, which explains how ClickHouse indexes work in general. Here's a link to the discussion of indexes.

https://youtu.be/1TGGCIr6dMY?t=1933

p.s. The blog article is missing some images that WordPress seems to have lost but you'll still get the idea. (Should be fixed shortly.)

Disclaimer: I work for Altinity

link

arynda 1340 days ago

Not in particular sorry, most of the good content I've found is on Altinity [1] and Alibaba's technical blogs [2][3]. These tend to be mostly focused on how the data itself is stored and how to use Clickhouse, but don't really dive into the specifics of how query processing is performed.

[1] https://altinity.com/blog/

[2] https://www.alibabacloud.com/blog/clickhouse-kernel-analysis...

[3] https://www.alibabacloud.com/blog/clickhouse-analysis-of-the...

link

twoodfin 1340 days ago

One obvious way is to build a bitmap indexed by row position for each filter. Both the "&" intersect and the final bit count can be rocket fast on modern CPU vector units.

link