Hacker News new | ask | show | jobs
by gianm 2289 days ago
Druid committer here. (Also, I think we've met before in SF!)

One thing I wanted to add with regard to performance. Druid does indeed get a big boost from the fact that it uses inverted indexes for filtering. It also gets a boost from having a wide variety of approximate algorithms you can use if you want (for things like topN, count distinct, set difference/intersection, quantiles, etc). But straight scan performance has been improving quite a bit recently too.

The biggest change related to straight scan perf is fully vectorizing the query engine, which is partially done as of the latest release (0.17): https://druid.apache.org/docs/latest/querying/query-context..... In benchmarks, the implementation so far has been posting row scan rate improvements in the 2-3x range. I expect we'll be able to round it out and have it work for all queries over the next couple of releases. The multiples involved mean this is quite meaningful if you do a lot of straight scans.

There's plenty of other stuff going on too: our latest release added parallel merging of large result sets. Our next one (0.18) is going to add a new, more efficient hash aggregation engine. That next release is also going to add a JOIN operator -- not perf related, but probably the number one most requested feature.

2 comments

I was reading the details on inverted index usage in Druid, but what is described seems to be bitmap indexes, not inverted indexes.

Inverted indexes map distinct values in a column to a list of document ids containing the value. Bitmap indexes map distinct values to an array of booleans the same length as the number of documents, with true for presence and false for absence. Both index types can be highly compressed, of course.

Can you clarify what Druid is using?

Logically, an array of booleans and a set of integers are equivalent. So in the Druid developer community we usually use the terms interchangeably. But to be precise, our indexes are all stored as bitmaps and compressed with bitmap compression libraries.
Good point regarding aggregations, especially if done at ingest time w/rollup they really make a big difference. Clickhouse has aggregations too but they are done at merge time in the background.

Vectorized query engine and JOINs sounds awesome.

(We did meet in SF! Beer hall!)