Hacker News new | ask | show | jobs
by jacquesnadeau 3156 days ago
We built Dremio (github.com/dremio/dremio-oss, apache licensed) entirely on top of Apache Arrow specifically for the purposes of creating a high speed analytical capabilities including MOLAP like work as well as other forms of caching/acceleration for analytical workloads. Other products/projects are also starting to adopt a similar technical architecture.
2 comments

Dremio looks very interesting indeed. What would you recommend for interacting with Arrow with more control, as a library? I'm interested in creating new Arrow-based data sources, not using it as an intermediary to other data sources.

On a side note - what other products/projects did you mean?

The Arrow project itself is a set of libraries. One of the things we'll do is try to add more algorithms over time to it so if you want say, a fast arrow sort or arrow predicate application. Full SQL is always far more complex and I can't see the project itself .

The engine inside of Dremio is something we call Sabot (a shoe for modern arrows, see sabot round on wikipedia). We hope to make it modular enough one day to use a library but it isn't there yet.

In regards to your other question re projects/products: Arrow contributors are actively trying to get more adoption of Arrow as an interchange format for several systems. We've had discussions around Kudu (no serious work done yet afaik). Parquet-to-Arrow for multiple languages is now available. Arrow committers include committers from several other projects such as HBase, Cassandra, Phoenix, etc. The goal is ultimately to figure integrations with all.

In most cases, these data storage systems are saddled with slow interfaces for data access. (Think row-by-row, cell-by-cell interfaces.) Arrow, among other things, allows them to communicate through a much faster mechanism (shared memory--or at least shared representation if not node local).

How does dremio differ from PrestoDB? As far as I know, PrestoDB can also virtualize access to many data sources and join data between them. We didn't go deep with PrestoDB because our basic tests for multi-source joins ran very slowly, and it seemed to pull all data from both joined tables into one place. I'm not a Prestodb expert, so maybe there's a better way to do it (all suggestions welcome).

What's the differentiator? Is dremio smarter somehow and avoids copying all data to perform a simple join? Or does it copy the data the same way but Arrow lets it be faster than Presto? What's on your roadmap?

PrestoDB is similar to Impala, Hive and other SQL Engines. Each is designed to do distributed SQL processing. Dremio does embed an OSS distributed SQL processing engine (Sabot, built natively on Arrow) as well but we see that as only a means to an end. Our focus is much more on being a bi & data fabric/service.

At the core of this vision are: very advanced pushdowns (far beyond other OSS systems), a powerful self-service UI for managing, curating and sharing data (designed for analysts, not just engineers) and--most importantly--the first open source implementation of distributed relational caching for all types of data. You can see more details about this last part in a deck I presented at DataEngConf early today: https://www.slideshare.net/dremio/using-apache-arrow-calcite...

Thank you very much for a thorough response. I think I would be happy with a library without SQL support, as long as filtering, grouping would be supported. Seems like that would be Sabot :) Maybe one day I'll be able to use it.
For those level of operations, the Arrow library itself will probably have something fairly soon.
How does Dremio compare to non-open source Querona? Is it a similar service to Dremios?
Dremio is focused on a combination of data access, acceleration and a self-service analyst experience (Tableau/Qlik one-click integration, data curation and data management).

We also invest heavily in our pushdowns. For example, we invested heavily in our Elasticsearch capabilities (support for CONTAINS and Lucene syntax, Painless and Groovy pushdowns, index leveraging, etc) and I'm pretty comfortable saying we're the best in world at exposing Elastic in a SQL context. Similarly for our other sources (join & window pushdowns in Oracle, SqlServer, etc, high speed predicate application in Hadoop, Mongo aggregation pipeline pushdowns including support unwind, etc).

I don't know Querona well (just a cursory site review). They seem much more focused on classic federation. It would also be important to understand when they repackage the Spark source connectors versus where they have built something that pushes down more powerfully (which you typically need for the best performance with these newer data source systems).