| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tech_ken 426 days ago
	The Git-like approach to data versioning seems really promising to me, but I'm wondering what those merge operations are expected to look like in practice. In a coding environment, I'd review the PR basically line-by-line to check for code quality, engineering soundness, etc. But in the data case it's not clear to me that a line-by-line review would be possible, or even useful; and I'm also curious about what (if any) tooling is provided to support it? For example: I saw the YouTube video demo someone linked here where they had an example of a quarterly report pipeline. Say that I'm one of two analysts tasked with producing that report, and my coworker would like to land a bunch of changes. Say in their data branch, the topline report numbers are different from `main` by X%. Clearly it's due to some change in the pipeline, but it seems like I will still have to fire up a notebook and copy+paste chunks of the pipeline to see step-by-step where things are different. Is there another recommended workflow (or even better: provided tooling) for determining which deltas in the pipeline contributed to the X% difference?

1 comments

zenlikethat 426 days ago

That’s a great question. Diffing is one area we’ve thought a bit about but still need to dedicate more cycles to. One thing I would be curious about is, what are you doing in these notebooks to check? For what it’s worth, could possibly have an intermediate Python model that does some calculation to look at differences and materializes the results to a table, which you could then query directly for further insight.

One thing we do have support for “expectations” — model-like Python steps that check data quality, and can flag it if the pipeline violates them.

link

tech_ken 426 days ago

> For what it’s worth, could possibly have an intermediate Python model that...materializes the results to a table

I think this is kind of the answer I was looking for, and in other systems I've actually manually implemented things like this with a "temp materialize" operator that's enabled by a "debug_run=True" flag. With the NB thing basically I'm trying to "step inside" the data pipeline, like how an IDE debugger might run a script line by line until an error hits and then drops you into a REPL located 'within' the code state. In the notebook I'll typically try to replicate (as close as possible) the state of the data inside some intermediate step, and will then manually mutate the pipeline between the original and branch versions to determine how the pipeline changes relate to the data changes. I think the dream for me would be to having something that can say "the delta on line N is responsible for X% of the variance in the output", although I recognize that's probably not a well defined calculation in many cases. But either way at a high level my goal is to understand why my data changes, so I can be confident that those changes are legit and not an artifact of some error in the pipeline.

Asserting that a set of expectations is met at multiple pipeline stages also gets pretty close, although I do think it's not entirely the same. Seems loosely analogous to the difference between unit and integration/E2E tests. Obviously I'm not going to land something with failing unit tests, but even if tests are passing the delta may include more subtle (logical) changes which violate the assumptions of my users or integrated systems (ex. that their understanding the business logic is aligned with what was implemented in the pipeline).

link

jtagliabuetooso 426 days ago

"In the notebook I'll typically try to replicate (as close as possible) the state of the data inside some intermediate step, and will then manually mutate the pipeline between the original and branch versions to determine how the pipeline changes relate to the data changes."

You can automate many changes / tests by materializing the parent(s) of the target table, and use the SDK to produce variations of a pipeline programmatically. If your pipeline has a free parameter (say top-k=5 for some algos), you could just write a Python for loop, and do something like:

client.create_branch() client.run()

for each variation, materializing k versions at the end that you can inspect (client.query("SELECT MAX ...")

The broader concept is that every operation in the lake is immutably stored with an ID, so every run can be replicated with the exact same data sources and the exact same code (even if not committed to GHub), which also means you can run the same code varying the data source, or run a different code on the same data: all zero-copy, all in production.

As for the semantics of merge and other conflicts, we will be publishing by end of summer some new research: look out for a new blog post and paper if you like this space!

link

tech_ken 425 days ago

Wow that's super cool, thanks for explaining! I will definitely keep an eye out; having that level of certifiability and replayability at a pipeline level is something my team has been having too much angst about. It would be incredible to be able to abstract it away as cleanly as we do code management.

link

jtagliabuetooso 425 days ago

Awesome, you can write me anytime to geek out (jacopo.tagliabue@bauplanlabs.com) or follow us for more community sharing (papers, deep tech blog posts: https://www.linkedin.com/in/jacopotagliabue/).

The full auditability is already here today though, so we are always happy to hear your feedback on our public sandbox (you can join for free from our website, and reach out to us at anytime for good and bad feedback!)

link