Hacker News new | ask | show | jobs
by jdoliner 1422 days ago
I was at Airbnb when we open-sourced Airflow, it was a great solution to the problems we had at the time. It's amazing how many more use cases people have found for it since then. At the time it was pretty focused on solving our problem of orchestrating a largely static DAG of SQL jobs. It could do other stuff even then, but that was mostly what we were using it for. Airflow has become a victim of its success as it's expanded to meet every problem which could ever be considered a data workflow. The flaws and horror stories in the post and comments here definitely resonate with me. Around the time Airflow was opensource I starting working on data-centric approach to workflow management called Pachyderm[0]. By data-centric I mean that it's focused around the data itself, and its storage, versioning, orchestration and lineage. This leads to a system that feels radically different from a job focused system like Airflow. In a data-centric system your spaghetti nest of DAGs is greatly simplified as the data itself is used to describe most of the complexity. The benefit is that data is a lot simpler to reason about, it's not a living thing that needs to run in a certain way, it just exists, and because it's versioned you have strong guarantees about how it can change.

[0] https://github.com/pachyderm/pachyderm

2 comments

i want to be able to trigger datasets to be rebuilt automatically when their dependencies change, which as i understand it is a large part of pachyderm's value proposition, but it is unclear how to integrate pachyderm into the larger data ecosystem. my users expect data to be available through a hive metastore or aws glue data catalog. they expect to be able to query it with aws athena, snowflake (as external tables), and other off the shelf tools. i need to be able to leverage apache iceberg (or delta lake or hudi etc) to incrementally update datasets that are costly to rebuild from scratch. it doesn't seem that pachyderm can do any of these things, but maybe i am just missing how it would do them? i would love to have a scheduler that is just responsible for triggering datasets to update when their dependencies change, but it seems that pachyderm is built around a closed ecosystem which makes it incompatible with tools outside that ecosystem.
Cool. Are there any published benchmarks on how the data versioning engine scales?
We are doing a whole bunch of performance testing on the new 2.0 engine that we released at the end of 2021. We'll be publishing those.
The end of 2021?
Yes, Airflow 2 was released last year.