Hacker News new | ask | show | jobs
by mildbyte 1327 days ago
Just wanted to also give a shout out to Apache DataFusion[0] that IOx relies on a lot (and contributes to as well!).

It's a framework for writing query engines in Rust that takes care of a lot of heavy lifting around parsing SQL, type casting, constructing and transforming query plans and optimizing them. It's pluggable, making it easy to write custom data sources, optimizer rules, query nodes etc.

It's has very good single-node performance (there's even a way to compile it with SIMD support) and Ballista [1] extends that to build it into a distributed query engine.

Plenty of other projects use it besides IOx, including VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily using it to build Seafowl [2], an analytical database that's optimized for running SQL queries directly from the user's browser (caching, CDNs, low latency, some WASM support, all that fun stuff).

[0] https://github.com/apache/arrow-datafusion

[1] https://github.com/apache/arrow-ballista

[2] https://github.com/splitgraph/seafowl

2 comments

DataFusion is great, we're happy to be contributing to it. Also excited to see so many people around the world picking it up and contributing as well. With our development efforts on IOx, it's like a strong tailwind. But we put a ton of effort into helping manage community efforts (thanks, alamb! our developer on IOx that is also on the Arrow PMC).
Original author of DataFusion/Ballista here. Having alamb and others from InfluxData involved has been a huge help in driving the project forward and helping build an active community behind the project. It is genuinely hard to keep up with the momentum these days!
Hi, I just had a glance over the DataFusion project. Very interesting work out there which I will be definitely keeping the track of but I've got a genuine question. Do you sometimes find development in Rust a little bit challenging for large-scale and performance sensitive type of work?

I say this because I've noticed more than several PRs fixing (large) performance regressions which to my understanding were mostly introduced due to unforeseen or unexpected Rust compiler subtleties which would then lead to less than optimal code generation. One example of such event was a naive and simply looking abstraction that was introduced and which brought down the performance by something like 50% in TPC-H benchmarks. This really struck me a little bit, especially because it seems quite hard to identify the root cause, and I would like to hear the experiences from the first hand. Thanks a bunch!

Your initial experiments and decision to build on arrow-rs has been great for the project. Thank you and everyone involved.
> We're heavily using it to build Seafowl, an analytical database that's optimized for running SQL queries directly from the user's browser...

Interesting. Where does seafowl fit in when I compare it with, say, data-stack-in-a-box approach, for ex: meltano + dbt + duckdb + superset [0]? Is my thinking right that seafowl possibly replaces both duckdb (with IOx) and superset (if there's a web front-end)?

Incidentally, dagster had an article up just yesterday making a case for poor-man's datalake with dbt + dagster + duckdb [1]. What does splitgraph replace if I were to use it in a similar setup?

Thanks.

[0] https://archive.is/DxU1e

[1] https://archive.is/5ikU4

Great question! With Seafowl, the idea is different from what the modern data stack addresses. It's trying to simplify public-facing Web-based visualizations: apps that need to run analytical queries on large datasets and can be accessed by users all around the world. This is why we made the query API easily cacheable by CDNs and Seafowl itself easy to deploy at the edge, e.g. with Fly.io.

It's a fairly different use case from DuckDB (query execution for Web applications vs fast embedded analytical database for notebooks) and the rest of the modern data stack (which mostly is about analytics internal to a company). Just to clarify, we're not related to IOx directly (only via us both using Apache DataFusion).

If we had to place Seafowl _inside_ of the modern data stack, it'd be mostly a warehouse, but one that is optimized for being queried from the Internet, rather than by a limited set of internal users. Or, a potential use case could be extracting internal data from your warehouse to Seafowl in order to build public applications that use it.

We don't currently ship a Web front-end and so can't serve as a replacement to Superset: it's exposed to the developer as an HTTP API that can be queried directly from the end user's Web browser. But we have some ideas around a frontend component: some kind of a middleware, where the Web app can pre-declare the queries it will need to run at build time and we can compute some pre-aggregations to speed those up at runtime. Currently we recommend querying it with Observable [0] for an end-to-end query + visualization experience (or use a different viz library like d3/Vega).

Re: the second question about Splitgraph for a data lake, the intention behind Splitgraph is to orchestrate all those tools and there the use case is indeed the modern data stack in a box. It's kind of similar to dbt Labs's Sinter [1] which was supposed to be the end-to-end data platform before they focused on dbt and dbt Cloud instead: being able to run Airbyte ingestion, dbt transformations, be a data warehouse (using PostgreSQL and a columnar store extension), let users organize and discover data at the same time. There's a lot of baggage in Splitgraph though, as we moved through a few iterations of the product (first Git/Docker for data, then a platform for the modern data stack). Currently we're thinking about how to best integrate Splitgraph and Seafowl in order to build a managed pay-as-you-go Seafowl, kind of like Fauna [2] for analytics.

Hope this helps!

[0] https://observablehq.com/@seafowl/interactive-visualization-...

[1] https://www.getdbt.com/blog/whats-in-a-name/

[2] https://fauna.com/