Hacker News new | ask | show | jobs
by Maro 2849 days ago
We base our whole DS infrastructure on Airflow (and Superset):

http://bytepawn.com/fetchr-airflow.html

http://bytepawn.com/fetchr-data-science-infra.html

Airflow is somewhere between good enough and pretty cool, it's based on what we had at Facebook (called Dataswarm).

IMO in 2-3 years Airflow will be the de-facto ETL standard, like Hadoop used to be for "Big data". If you're rolling your own ETL at this point, you're wasting your time. If you're using something else, you're (probably) missing out on ETL-as-code goodness.

3 comments

IMO Airflow currently is the de-facto standard and in 2-3 years it will go the way of Hadoop
How does it compare to big iron enterprise ETL tools like IBM datastage? I have only dabbled but it looks far more appealing for a variety of reasons.
For context, I used Datastage, Informatica, Ab Initio and SSIS in previous lives an went on to write the first version of Airflow. I developed a taste for pipelines-as-code while working at Facebook using an internal tool that is not open source.

I'd argue that pipelines as code, as opposed to dragndrop GUIs, is a better approach, at least for people who are comfortable writing code. Code is easy to version, test, diff, collaborate and allows for the creation of arbitrary abstractions.

ETL tools just can't compete with a tool that forces code to do anything. It might seem backwards, but we've abandoned all non-code environments and force pure-code for configuration for all of our pipelines.
> like Hadoop used to be for "Big data"

So, you are saying it will be going with time, after it becomes the de-facto ETL tool?