Hacker News new | ask | show | jobs
by timsehn 2250 days ago
Tim , CEO of Liquidata, the company that built Dolt and DoltHub here. This is how we store the version controlled rows so that we get structural sharing across versions (ie. 50M + one row chgange becomes 50M+1 entries in the database not 100M with no need to replay logs):

https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...

1 comments

Thanks, that looks like an interesting approach. I may have missed this in the article, but let's say I have a SQL database with 600m records, and an ETL process does massive upserts (20m records) every day, with many UPDATEs on 1-2 fields.

Wouldn't discovering what those changes are still entail heavy database queries? Unless Dolt has a hook into most SQL databases' internal data structures? Or WALs?

You have to move your data to Dolt. Dolt is a database. It's got its own storage layer, query engine, and query parser. Diff queries are fast because of the way the storage layer works.

Right now, Dolt can't be distributed (ie. data must fit on one hard drive) easily so it's not meant for big data, more data that humans interact with, like mapping tables or daily summary tables. But, long term if we can get some traction, we plan on building "big dolt" which would be a distributed version that can scale to as big as you want.

Ah now I understand!

So for most analytic workloads, typically a columnstore db is used due to the need for performance and advanced SQL features (windowing functions) for complex analytic queries -- which I don't expect Dolt to replace. Which means if we wanted to use Dolt's features, we would have to continuously ETL the data into Dolt, which would entail mirroring the entire database (or at least the parts we want to version control).

Dolt essentially becomes a derived database specifically used for versioning. I see how this might work for some use cases.

If you are working within the Apache Spark ecosystem you can us DeltaLake https://delta.io/ to create 'merge' datasets which are transactional, versioned and allow time travel by both version number and timestamp.
Another alternative to Deltalake is Apache Hudi, which also includes bloom filters for indexing time-travel queries (efficiently exclude any files given the supplied time constraint). Z-ordered indexing in Deltalake is not available yet in open-source deltalake, only in Databricks version.
One of the cool things about Dolt is that you can query the diff between two commits. This functionality is available through special system tables. You specify two commits in the WHERE clause, and the query only returns the rows that changed between the commits. The syntax looks like:

`SELECT * FROM dolt_diff_$table where from_commit = '230sadfo98' and to_commit = 'sadf9807sdf'`