| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by seddonm1 1800 days ago

Disclosure: I am a contributor to Datafusion.

I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.

Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.

Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

1 comments

eduren 1800 days ago

Oh hey, thanks for the info!

I spent some time evaluating Arc for my team's ETL purposes and I was really impressed. I hesitated somewhat to move forward with it because it seemed really tied into the Spark ecosystem (for great reasons). We just weren't at all familiar with deploying and operating Spark, so ended up rolling our own scripts on top of (an existing) Airflow cluster for now.

Besides performance reasons, are there any other advantages to porting Arc to run on top of datafusion? If the porting effort was shared somewhere I'd love to dig in and see what the proof-of-concept looks like.

link

seddonm1 1799 days ago

Hi eduren. Give me a few days and Ill see what i can publish as a WIP repo. The aim of Arc was to always allow swapping the execution engine whilst retaining the logic - hence SQL -so this should hopefully be easy.

link

FridgeSeal 1800 days ago

Rust stuff tends to be a bit more resource efficient than Java.

Currently using DataFusion from Rust, and being more resource efficient means we can use smaller machines, which means our costs go down. Deploying services is also faster (smaller docker images, faster startup times) and puts less extraneous load on our machines.

I imagine Arc, and thus downstream users, would see similar benefits.

link