|
|
|
|
|
by prions
1327 days ago
|
|
IMO Data engineering is already a specialized form of software engineering. However what people interpret as DE's being slow to adopt best practices from traditional software engineering is more about the unique difficulties of working with data (especially at scale) and less about the awareness or desire to use best practices. Speaking from my DE experience at Spotify and previously in startup land, the biggest challenge is the slow and distant feedback loop. The vast majority of data pipelines don't run on your machine and don't behave like they do on a local machine. They run as massively distributed processes and their state is opaque to the developer. Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far. And integration testing really needs to work at scale with easily recyclable infrastructure (and data) to not be a massive drag on developer productivity. Even getting the correct kind of data to be fed into a test can be very difficult if the ops/infra of the org isn't designed for it. The best data tooling isn't going to look exactly like traditional swe tooling. Tools that vastly reduce the feedback loop of developing (and debugging) distributed pipelines running in the cloud and also provide means of validating the output on meaningful data is where tooling should be going. Trying to shoehorn traditional SWE best practices will really only take off once that kind of developer experience is realized. |
|
I'm glad to see someone calling this out because the comment here are a sea of "data engineering needs more unit tests." Reliably getting data into a database is rarely where I've experienced issues. That's the easy part.
This is the biggest opportunity in this space, IMHO, since validation and data completeness/accuracy is where I spend the bulk of my work. Something that can analyze datasets and provide some sort of ongoing monitoring for confidence on the completeness and accuracy of the data would be great. These tools seem to exist mainly in the network security realm, but I'm sure they could be generalized to the DE space. When I can't leverage a second system for validation, I will generally run some rudimentary statistics to check to see if the volume and types of data I'm getting is similar to what is expected.