|
|
|
|
|
by mywittyname
1341 days ago
|
|
> Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far. I'm glad to see someone calling this out because the comment here are a sea of "data engineering needs more unit tests." Reliably getting data into a database is rarely where I've experienced issues. That's the easy part. This is the biggest opportunity in this space, IMHO, since validation and data completeness/accuracy is where I spend the bulk of my work. Something that can analyze datasets and provide some sort of ongoing monitoring for confidence on the completeness and accuracy of the data would be great. These tools seem to exist mainly in the network security realm, but I'm sure they could be generalized to the DE space. When I can't leverage a second system for validation, I will generally run some rudimentary statistics to check to see if the volume and types of data I'm getting is similar to what is expected. |
|
They all have various strengths and weaknesses with respect to anomaly detection, schema change alerts, rules-based approaches, sampled diffs on PRs, incident management, tracking lineage for impact analysis, and providing usage/performance monitoring.
Datafold, Metaplane, Validio, Monte Carlo, Bigeye
Great Expectations has always been an open source standby as well and is being turned into a product.