|
|
|
|
|
by tech_ken
426 days ago
|
|
The Git-like approach to data versioning seems really promising to me, but I'm wondering what those merge operations are expected to look like in practice. In a coding environment, I'd review the PR basically line-by-line to check for code quality, engineering soundness, etc. But in the data case it's not clear to me that a line-by-line review would be possible, or even useful; and I'm also curious about what (if any) tooling is provided to support it? For example: I saw the YouTube video demo someone linked here where they had an example of a quarterly report pipeline. Say that I'm one of two analysts tasked with producing that report, and my coworker would like to land a bunch of changes. Say in their data branch, the topline report numbers are different from `main` by X%. Clearly it's due to some change in the pipeline, but it seems like I will still have to fire up a notebook and copy+paste chunks of the pipeline to see step-by-step where things are different. Is there another recommended workflow (or even better: provided tooling) for determining which deltas in the pipeline contributed to the X% difference? |
|
One thing we do have support for “expectations” — model-like Python steps that check data quality, and can flag it if the pipeline violates them.