Hacker News new | ask | show | jobs
by azurezyq 1328 days ago
This is actually a great observation. Data pipelines are often written in various languages, running on heterogenous systems, with different time alignment schemes. I always found it tricky to "fully trust" a piece of result. Hmm, any best practice from your side?
2 comments

Without getting into the weeds of it, I'd say smooth out the rough edges in your development experience and make it behave as similar to prod as possible. If there's less friction there's less incentive to cut corners and make hacks imo.

Some pain points:

- Does it take forever to spin up infra to run a single test?

- Is grabbing test data a manual process? This can be a huge pain especially if the test data is binary like avro or parquet. Test inputs and results should be human friendly

- Does setting up a testing environment require filling out tons of yaml files and manual steps?

- Things built at the wrong level of abstraction! This always irks me to experience. Keep your abstractions clean between which tools in your data stack do what. When people start inlining task-specific logic at the DAG level in airflow, or let their individual tasks figure out triggering or scheduling decisions is when things just become confusing.

Right now my workflow allows me to run a prod job (google cloud dataflow) from my local machine. It consumes prod data and writes to a test-prefixed path. With unit tests on the scala code + successful run of the dataflow job + validation and metrics thrown on the prod job I can feel pretty comfortable with the correctness of the pipeline.

Not OP, but a Data Engineer with 4 years experience in the space - I think the key is to first build the feedback loop - i.e. any thing that helps you answer how do you know the data pipeline is flowing and that the data is correct - then getting sign-off from both the producers and consumers of the data. Actually getting the data flowing is usually pretty easy after both parties agree about what that actually means.