|
|
|
|
|
by jtagliabuetooso
425 days ago
|
|
"In the notebook I'll typically try to replicate (as close as possible) the state of the data inside some intermediate step, and will then manually mutate the pipeline between the original and branch versions to determine how the pipeline changes relate to the data changes." You can automate many changes / tests by materializing the parent(s) of the target table, and use the SDK to produce variations of a pipeline programmatically. If your pipeline has a free parameter (say top-k=5 for some algos), you could just write a Python for loop, and do something like: client.create_branch()
client.run() for each variation, materializing k versions at the end that you can inspect (client.query("SELECT MAX ...") The broader concept is that every operation in the lake is immutably stored with an ID, so every run can be replicated with the exact same data sources and the exact same code (even if not committed to GHub), which also means you can run the same code varying the data source, or run a different code on the same data: all zero-copy, all in production. As for the semantics of merge and other conflicts, we will be publishing by end of summer some new research: look out for a new blog post and paper if you like this space! |
|