| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by roenxi 729 days ago

No, unfortunately those factors are all very related.

Once you have GUI ETL tools, in my experience, you can't modularise because the ETL tool makes assumptions about where the boundaries are that are different from what suit the domain in question. Observability falls over because you're now limited to the ETL tool instead of the domain. Scale suffers because now the ETL data model needs to be preserved and high-performance tricks might need the entire tool to be worked around, etc, etc.

Code is the highest-performance environment we have for working with huge complex systems made of if statements and loops. Giving that up to go to a tool doesn't actually yield any advantages; there needs to be an abstraction with huge practical benifits and a DAG isn't it. Modeling a DAG in a true programming language isn't hard enough to justify moving away from an IDE.

An ETL pipeline in practice is still uncomfortably close to a big spaghetti of if-thens and loops, tooling and extra models create patterns that often block a lot of the useful properties you list. The real gains come from not writing a custom scheduler, but splitting out the valuable scheduler from the ETL tool means that you have a scheduler, not an ETL tool. Sometimes there is an ecosystem of adaptors that makes a big difference, but if that doesn't meet your engineering requirements then the tool is useless (because you don't have any real levers to pull on the scale/observability/modularity and maintainability front).