Hacker News new | ask | show | jobs
by DSpinellis 259 days ago
Apache Airflow solves a very different problem. Its DAGs are static dependencies between sequentially executed processing steps, whereas the DAGs of dgsh express live direct data flows.
2 comments

Yeah, there are also the boxes and lines tools like

https://www.knime.com/

which have their own subculture. You could solve the same problems they do with pandas and scikit-learn but people who use those tools would never use pandas and scikit-learn and vice versa.

Circa 2015 I was thinking those tools all had the architectural flaw that they pass relational rows over the lines as opposed to JSON objects (or equivalent) which means you had to realize joins as highly complex graphs where things that seem like local concerns to me require a global structure and where what seems like a little change to management changes the whole graph in a big way.

I found the people who were buying up that sort of tools didn’t give a damn because they thought customers demanded the speed of columnar execution which our way couldn’t deliver.

I made a prototype that gave the right answers every time and then went to work for a place which had some luck selling their own version that didn’t always give the right answers because: they didn’t know what algebra it supported, didn’t believe something like that had an algebra, and didn’t properly tear the pipeline down at the end.

Do you mean to say that two non-dependant tasks in an Airflow DAG aren't able to concurrently execute? Thats not my experience. I'm also confused by the use of 'static' in this context.
That's the point: non-dependant tasks can run concurrently in Airflow. In sh/BAsh/dgsh dependant tasks can also run concurrently, as in tar cf - . | xz.
Ok. thank you!