Hacker News new | ask | show | jobs
by dalailambda 2158 days ago
A quote from the article I would object to is "for large datasets and complex transformations this architecture is far from ideal. This is far from the world of open-source code on Git & CI/CD that data engineering offers - again locking you into proprietary formats, and archaic development processes."

No one is forcing you to use those tools on top of something like Snowflake (which is just a SQL interface). These days we have great open source tools (such as https://www.getdbt.com/) which let you write plain SQL that you can then deploy to multiple environments, perform automated testing and deployment, and do fun scripting. At the same time, dealing with large datasets in a spark world is full of lower level details, whereas in a SQL database it's the exact same query you would run on a smaller dataset.

The reality is that the ETL model is fading in favour of ELT (load data then transform it in the warehouse) because maintaining complex data pipelines and spark clusters make little sense when you can spin up a cloud data warehouse. In this world we don't just need less developer time, those developers don't have to be engineers that can write and maintain spark workloads/clusters, they can be analysts who are able to do transformations and have something valuable out to the business faster than the equivalent spark data pipeline can be built.

1 comments

Very valid points: 1) Agree that Snowflake is far easier to use than Spark. 2) Agree that DBT is a great tool.

ETL workflows normally processing 10s of TBs and workflows with large and complex business logic is the context. With Spark code, you can break down your code into smaller pieces, see data flow across them, write unit tests, and have the entire thing still execute as a single SQL query.

Don't large SQL scripts become really gnarly for complex stuff - nothing short of magical incantations. I can't see data flow from a subquery for debugging without changing code.

Prophecy as a company is focused on making Spark significantly easier to use!