|
|
|
|
|
by verdverm
2250 days ago
|
|
https://dolthub.com is the cool kid right now. There is pacaderm, git lfs, IPFS. Really what we need is version control for data, it's not just an ML data problem. It's a little different though, because you would like to move computation to data, rather than the other way around |
|
It seems to me to be able to time-travel in data you almost need to store the Write-Ahead Log of database transactions and be able to replay that. Debezium captures the CDC information, but it's a infrastructure level tool rather than a version control tool.
In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written". Then you can roll things back to any ETL point in a performant fashion. This is particularly useful for debugging recursive algorithms that get retrained every day.
But these are infrastructure level approaches. I'm not sure that it's a problem for a version control tool.