Hacker News new | ask | show | jobs
by eliomattia 1149 days ago
dolt came up when searching git for data, it seems great, though I have never used it. I know it works on prolly trees rather than on top of git. I am really curious to learn about that choice, why exactly not on git? How can you offer data removal from history without rebuilding repos? Especially here in the EU, an ongoing conversation is taking place about structured data documentation, collaboration, and tech design choices that affect privacy.
1 comments

We wanted a solution that scales up to terabytes and has fast database access. Storing giant CSV files in git natively doesn't scale, and if you store them in S3 you lose the fast diff and all merge capability.

GDPR does require rebasing in some cases as near as we can figure. There are some creative ways to not require taking an outage during this rebase, or some other creative schemes for storing all PII in non-versoined tables. We haven't built any of that yet though, nobody has asked for GDPR support.

Important points. I aim for version control for data repositories with HDD efficiency, visualization of diffs for collaboration, and API accessibility of individual datasets from multiple identifiable versions from git. Datasets can then be streamed into a database individually. Migrating by processing diffs from commits, a future possibility. Fast direct database-like repository access with in-situ editing is not my initial goal, rather a clonable Git repository that is separate from the database working areas, can be connected to MySQL using i/o pipelines, and can easily export datasets versions as labeled snapshots.

On diffs: diffing workload is lazy and needed just once upon requesting a commit for a repo, the bidirectional diff results (incl. index and columns changes) - not the newly changed files - are then committed as csv or pointer objects, git natively supports seeing such objects as new. I had to write an engine to rebuild and serve snapshots from git history, now on localhost with posting to S3, later also in the cloud. Diffing a dataset (csv, Excel, SQL) against the current checked out version, which now resides in a gitignored "datasets" folder in the working directory, now takes ~20 seconds on a 1 GB CSV dataset with 10M rows. Diffing is not always needed, can be bypassed with incremental workflows (new data daily), committing just the diff. I can handle repos of 10s of GBs, with individual datasets of GBs each. Where to put the compute workload of diffing, checking out, and building snapshots is under careful consideration.

Merge will be assisted, with files in S3 or without: 3-way comparing commits boils down to reusing features from the snapshot rebuild engine, starting from the common ancestor and using only the diffs, and handling conflicts. Merging small changes on large datasets involves dealing with the small changes only. Data and diffs can come from S3 if not already on localhost, to merge data, not pointers. Visually presenting diffs in non-adjacent commits requires UI, current git tools would not interpret history correctly.