| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wenc 2250 days ago

The utility of version controling production-sized (not sample training data) data (as opposed to code) is something I've having trouble grasping unless I'm missing something here -- and I may be, so please enlighten me.

It seems to me to be able to time-travel in data you almost need to store the Write-Ahead Log of database transactions and be able to replay that. Debezium captures the CDC information, but it's a infrastructure level tool rather than a version control tool.

In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written". Then you can roll things back to any ETL point in a performant fashion. This is particularly useful for debugging recursive algorithms that get retrained every day.

But these are infrastructure level approaches. I'm not sure that it's a problem for a version control tool.

3 comments

timsehn 2250 days ago

Tim , CEO of Liquidata, the company that built Dolt and DoltHub here. This is how we store the version controlled rows so that we get structural sharing across versions (ie. 50M + one row chgange becomes 50M+1 entries in the database not 100M with no need to replay logs):

https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...

link

wenc 2250 days ago

Thanks, that looks like an interesting approach. I may have missed this in the article, but let's say I have a SQL database with 600m records, and an ETL process does massive upserts (20m records) every day, with many UPDATEs on 1-2 fields.

Wouldn't discovering what those changes are still entail heavy database queries? Unless Dolt has a hook into most SQL databases' internal data structures? Or WALs?

link

timsehn 2250 days ago

You have to move your data to Dolt. Dolt is a database. It's got its own storage layer, query engine, and query parser. Diff queries are fast because of the way the storage layer works.

Right now, Dolt can't be distributed (ie. data must fit on one hard drive) easily so it's not meant for big data, more data that humans interact with, like mapping tables or daily summary tables. But, long term if we can get some traction, we plan on building "big dolt" which would be a distributed version that can scale to as big as you want.

link

wenc 2250 days ago

Ah now I understand!

So for most analytic workloads, typically a columnstore db is used due to the need for performance and advanced SQL features (windowing functions) for complex analytic queries -- which I don't expect Dolt to replace. Which means if we wanted to use Dolt's features, we would have to continuously ETL the data into Dolt, which would entail mirroring the entire database (or at least the parts we want to version control).

Dolt essentially becomes a derived database specifically used for versioning. I see how this might work for some use cases.

link

seddonm1 2250 days ago

If you are working within the Apache Spark ecosystem you can us DeltaLake https://delta.io/ to create 'merge' datasets which are transactional, versioned and allow time travel by both version number and timestamp.

link

jamesblonde 2250 days ago

Another alternative to Deltalake is Apache Hudi, which also includes bloom filters for indexing time-travel queries (efficiently exclude any files given the supplied time constraint). Z-ordered indexing in Deltalake is not available yet in open-source deltalake, only in Databricks version.

link

zachmu 2250 days ago

One of the cool things about Dolt is that you can query the diff between two commits. This functionality is available through special system tables. You specify two commits in the WHERE clause, and the query only returns the rows that changed between the commits. The syntax looks like:

`SELECT * FROM dolt_diff_$table where from_commit = '230sadfo98' and to_commit = 'sadf9807sdf'`

link

jacques_chester 2250 days ago

> In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written".

Not quite, this is "transaction time". You also need "valid time" to be truly bitemporal. Recovering the database as of some point in time is not enough to answer questions like "when will this fact become false?" or "when did our belief about when it would become false change?", because you didn't preserve assertions about the time range over which the fact was held to be true.

In terms of implementations, ranges are better than double timestamps. They provide their own assertion of monotonicity and can be easily used in exclusion indices.

I found that Snodgrass's textbook was a good introduction to the concepts and it's available for free: https://www2.cs.arizona.edu/~rts/tdbbook.pdf

link

wenc 2250 days ago

Yes, you're correct -- an omission on my part. You need "valid time" (otherwise it's just "uni"-temporal modeling).

Thank you for the link to Snodgrass' book. I've not seen a formal book on temporal modeling in SQL before, so this is fascinating.

link

jacques_chester 2250 days ago

Glad I could help! The research seems to have puttered on for a while after this book was written, but appears to fizzle out by around the turn of the millennium.

Some notion of bitemporalism showed up in SQL 2011, but somewhat constrained compared to what Snodgrass describes.

link

sgt101 2250 days ago

I worry about retraining every day. Isn't that a flag that says "It hasn't learned a thing and actually I'm just improving my backfitting score"?

link

wenc 2250 days ago

Not really -- in many forecasting applications in fast-changing markets, it is fairly common to dynamically retrain your recursive model to a moving window of historical data in order to adapt to your current environment (with some regularization). The length of the window depends on how fast the market changes.

For these types of recursive model applications, you cannot just fit the model once and forget about it.

link

somurzakov 2250 days ago

as long as it works well on out of sample data at deployment time, it is okay.

Until some major data drift happens, but you would notoce it anyways

link

sgt101 2249 days ago

Honestly, I've heard people in Vegas tell me the same about their strategies vs. slots. Genuinely, if you have made money from this - well done, take it out now, congratulate yourself. If you haven't...

link