Hacker News new | ask | show | jobs
by eliomattia 1148 days ago
Important points. I aim for version control for data repositories with HDD efficiency, visualization of diffs for collaboration, and API accessibility of individual datasets from multiple identifiable versions from git. Datasets can then be streamed into a database individually. Migrating by processing diffs from commits, a future possibility. Fast direct database-like repository access with in-situ editing is not my initial goal, rather a clonable Git repository that is separate from the database working areas, can be connected to MySQL using i/o pipelines, and can easily export datasets versions as labeled snapshots.

On diffs: diffing workload is lazy and needed just once upon requesting a commit for a repo, the bidirectional diff results (incl. index and columns changes) - not the newly changed files - are then committed as csv or pointer objects, git natively supports seeing such objects as new. I had to write an engine to rebuild and serve snapshots from git history, now on localhost with posting to S3, later also in the cloud. Diffing a dataset (csv, Excel, SQL) against the current checked out version, which now resides in a gitignored "datasets" folder in the working directory, now takes ~20 seconds on a 1 GB CSV dataset with 10M rows. Diffing is not always needed, can be bypassed with incremental workflows (new data daily), committing just the diff. I can handle repos of 10s of GBs, with individual datasets of GBs each. Where to put the compute workload of diffing, checking out, and building snapshots is under careful consideration.

Merge will be assisted, with files in S3 or without: 3-way comparing commits boils down to reusing features from the snapshot rebuild engine, starting from the common ancestor and using only the diffs, and handling conflicts. Merging small changes on large datasets involves dealing with the small changes only. Data and diffs can come from S3 if not already on localhost, to merge data, not pointers. Visually presenting diffs in non-adjacent commits requires UI, current git tools would not interpret history correctly.