|
|
|
|
|
by wenc
2250 days ago
|
|
The utility of version controling production-sized (not sample training data) data (as opposed to code) is something I've having trouble grasping unless I'm missing something here -- and I may be, so please enlighten me. It seems to me to be able to time-travel in data you almost need to store the Write-Ahead Log of database transactions and be able to replay that. Debezium captures the CDC information, but it's a infrastructure level tool rather than a version control tool. In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written". Then you can roll things back to any ETL point in a performant fashion. This is particularly useful for debugging recursive algorithms that get retrained every day. But these are infrastructure level approaches. I'm not sure that it's a problem for a version control tool. |
|
https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...