Hacker News new | ask | show | jobs
by gianm 2272 days ago
Hey Mani. Druid committer here. It actually is a column store! The project makes a big deal about its ability to do indexes and pre-aggregation because those are important capabilities and, while not unique, are also not universally supported by every column store out there. So they are interesting differentiators. But architecturally they are really just extra icing on the cake.

Personally I see stuff like Druid, MemSQL, Clickhouse, Redshift, BigQuery, and Snowflake as technological siblings in the space. These systems are all evolving rapidly too (well, the healthy ones are anyway) so it's definitely a good time to be an analytical database enthusiast.

With regard to the operational complexity, that's an interesting point. It shows up in two main ways, I think -- the multi-process architecture and usage of external deep storage. On huge clusters, which is what Druid was designed for, the idea is that explicitly separating components in this way gives you three benefits: they don't interfere with each other (spikes in ingestion load won't interfere with ability to query historical data), you can scale each one individually, and it makes most components "disposable" (as long as your storage is reliable, the other Druid components can be blown away and recreated without losing any data). It helps when you're trying to run a big cluster in a stateless / containerized environment.

But these aspects are less good on small clusters or single servers, where it just feels like a bunch of overhead. So we're currently working on simplifying some of this for people that aren't running huge clusters.

We're also expanding SQL support rapidly. Almost every release adds additional SQL capabilities. The next release is a big one, adding JOIN and GROUPING SETS operators. The project's goal is to support it all before too long -- up next after this release will likely be analytic functions.

If you're interested in checking out the community, we do meetups pretty often (all virtual now, though, due to COVID-19). We're also planning our first user conference later in the year @ https://druidsummit.org/.

1 comments

Hey Gian, I'm familiar with Druid since its start at metamarkets (and a client of that company). I've been following Imply and you guys have done great work at making Druid a lot better over the years.

I guess I should've stated relational columnstore to describe the others. Vertica has S3/remote storage interfaces similar to Historicals and all vendors are adding indexing to columnstore segments beyond partition/zone maps for fast seeks. MemSQL is the most advanced with in-memory tables to augment the disk-based columnstores.

The improved SQL support will help and the overall design of Druid makes sense, but I have to stand by the fact that I find it tough to recommend over the alternatives now. If everything's converging on similar functionality, what would you say is the roadmap for Druid's future advantage?

Those are good questions.

IMO Druid is most well-differentiated if you want to power an online, real-time, high-concurrency analytical application at scale. It is the use case Druid was originally designed for and still the one where the project shines the brightest. The reason mostly isn't related to things that database people usually talk about (storage format, indexes, etc). That stuff is important but isn't a major differentiator between systems in today's world. The reason is more related to the pieces in between servers, like locking, replication, fault tolerance, data partitioning and balancing, and resource management. Druid's approach to these things is relatively unique and gives it characteristics that allow it to do well at powering these sorts of apps at scale. I think it will remain an important advantage of Druid over other systems. Maybe one day the details would make a good blog post :)

As far as the roadmap goes, most of the work we're doing to make Druid better falls into two categories: first, stuff that makes it even better at this core analytical app engine use case; second, stuff that better supports new use cases, like the work on building out SQL. They are both important so usually each release has a bit of both.