| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jpgvm 2289 days ago

Clickhouse is much more resource efficient in many cases but is less flexible and importantly extensible than Druid.

Druid can easily be extended through available 3rd party extensions and you can write your own to implement custom serialisation formats, aggregations, connect to new streaming systems, read directly from whatever cold storage you have etc.

In the Clickhouse model you have to work out a lot more of that stuff yourself though these days it can read from Kafka directly which is useful.

Some things that are important for Clickhouse vs Druid at big scale is the rather large difference in indexing approaches. Clickhouse uses bloom filters and other probabilistic data structures to index large chunks of data, for the most part though actually checking for rows requires a full scan of that chunk to strip false positives.

This is different to Druid which uses full inverted indices for dimension filtering.

The tradeoff is basically Clickhouse is cheaper, especially when scaling out but Druid is faster especially when the cluster is under heavy concurrent query load, like serving analytics dashboards or data exploration interfaces to users.

Clickhouse excels when you want to scan most but not all the data most of the time. Namely reporting or bulk analytics queries that will hit most rows in a block.

I consider both to be excellent databases.

3 comments

gianm 2289 days ago

Druid committer here. (Also, I think we've met before in SF!)

One thing I wanted to add with regard to performance. Druid does indeed get a big boost from the fact that it uses inverted indexes for filtering. It also gets a boost from having a wide variety of approximate algorithms you can use if you want (for things like topN, count distinct, set difference/intersection, quantiles, etc). But straight scan performance has been improving quite a bit recently too.

The biggest change related to straight scan perf is fully vectorizing the query engine, which is partially done as of the latest release (0.17): https://druid.apache.org/docs/latest/querying/query-context..... In benchmarks, the implementation so far has been posting row scan rate improvements in the 2-3x range. I expect we'll be able to round it out and have it work for all queries over the next couple of releases. The multiples involved mean this is quite meaningful if you do a lot of straight scans.

There's plenty of other stuff going on too: our latest release added parallel merging of large result sets. Our next one (0.18) is going to add a new, more efficient hash aggregation engine. That next release is also going to add a JOIN operator -- not perf related, but probably the number one most requested feature.

link

barrkel 2288 days ago

I was reading the details on inverted index usage in Druid, but what is described seems to be bitmap indexes, not inverted indexes.

Inverted indexes map distinct values in a column to a list of document ids containing the value. Bitmap indexes map distinct values to an array of booleans the same length as the number of documents, with true for presence and false for absence. Both index types can be highly compressed, of course.

Can you clarify what Druid is using?

link

gianm 2285 days ago

Logically, an array of booleans and a set of integers are equivalent. So in the Druid developer community we usually use the terms interchangeably. But to be precise, our indexes are all stored as bitmaps and compressed with bitmap compression libraries.

link

jpgvm 2288 days ago

Good point regarding aggregations, especially if done at ingest time w/rollup they really make a big difference. Clickhouse has aggregations too but they are done at merge time in the background.

Vectorized query engine and JOINs sounds awesome.

(We did meet in SF! Beer hall!)

link

shaklee3 2289 days ago

I think what you're getting at can be accomplished with materialized views in clickhouse now. Most queries that might be fast with inverted indices can be solved that way.

Also, I don't think they use bloom filters for the index as far as I can tell from the documentation. There is certainly an option to use a bloom filter aggregator on a table for faster counts, but it's not the default. If you're referring to the fact that count () is not precise, there's a exact count function too. This is my speculation, though, and you may be fight.

link

jpgvm 2289 days ago

Yeah bloom filters aren't used by default my bad, by default you get no indices at all! I was thinking about the tokenbf/ngrambf indices. These help a ton in improving more sparse queries.

I will need to check out the materialised views. :)

link

peferron 2289 days ago

> I think what you're getting at can be accomplished with materialized views in clickhouse now. Most queries that might be fast with inverted indices can be solved that way.

Let's say you have data with a few dozen dimensions, and want to compute aggregations filtered by any user-supplied union or intersection of dimension values. This is a fairly common use case in analytics dashboards. How do materialized views help with that?

link

danielbln 2289 days ago

In our experience, on a significantly smaller scale, Clickhouse is vastly easier to operate compared to Druid, with all of its various components that all have various knobs and dials to configure and have to be orchestrated.

link

gianm 2289 days ago

Druid committer here. Fwiw, Druid was designed to run on huge clusters and that really shows up in the multi-process architecture. The idea is that if you separate the components needed for ingestion, historical processing, query routing, and coordination, then there are two benefits: they don't interfere with each other (spikes in ingestion load won't interfere with ability to query historical data), and also you can scale each one individually for your workload. You could even auto-scale some of them. For example, the original Druid cluster was operated with load-based auto-scaling for the ingestion processes.

That being said we are currently working on reducing the number of processes to 4 (from the current 6) for a "standard" setup. The main reason is that at smaller scale there isn't as much of a purpose to having a larger number of processes.

We're also working on removing some of the knobs. Actually, depending on what version you originally looked at, many of them might already be gone.

link

danielbln 2289 days ago

It's been a few years since we evaluated Druid. It's great to hear that you're simplifying things, especially for smaller setups!

link

jpgvm 2289 days ago

Yes this is definitely also true.

Druid complexity is coming down a bit compared to where it started. These days you need brokers, middlemanagers and historicals - for queries, ingestion and storage respectively.

In the past to do batch ingestion it also required Hadoop but there is now a native parallel batch ingestion system that runs on the middlemanagers as worker tasks that can read from S3/GCS/existing Druid segments.

Druid is by far the more complex but you get a lot for it and with k8s it's not as hard to run/manage as it was in the past.

link