Hacker News new | ask | show | jobs
by rektide 1507 days ago
I don't have a whole lot to say on this right now (very WIP), but I have a strong belief that git is a core tool we should be using for data.

Most data-formats are thick-formats, pack data into a single file. Part of the effort in switching to git would be a shift to trying to unpack our data, to really make use of the file system to store fine grained pieces of data.

It's been around for a while, but Irmin[1] (written in Ocaml) is a decent-enough almost-example of these kinds of practices. It lacks the version control aspect, but 9p is certainly another inspiration, as it encouraged state of all things to be held & stored in fine-grained files. Git I think is a superpower, but just as much: having data which can be scripted, which speaks the lingua-franca of computing- that too is a superpower.

[1] https://irmin.org/ https://news.ycombinator.com/item?id=8053687 (147 points, 8 years ago, 25 comments)

2 comments

You really want to use CRDTs, not data types subject to human resolved merge conflicts.
I feel like crdts are sold as a panacea. I can esily imagine users making conflicting changes, so I dont really see or understand what the real value or weaknesses of CRDTs are.

Im also used to seeing them used for online synchronization, & far less examples of distributed crdts, which is, to me, highly important.

Git by contrast has straightforward & good merge strategies. At this point, I feel like the problems are complex & that we need complex tools that leave users & devs in charge & steering. Im so ready to be wrong, but I dont feel like these problems are outsmartable; crdts have always felt like they try to define a too limited world. For now, I feel like tools for managing files between different fs'es are more complex, but a minimum level of possibility we need.

Whether conflict-resolution can be performed automatically or may require manual input is important at scale.

Human editors may cause the content within a CRDT datastructure to become inconsistent in the sense of "is this document understandable by another person", but they can't cause conflicts that block the editing process on-disk.

On the other hand, git merges can -- and frequently do -- involve conflict resolution that isn't effectively handled -- especially in a distributed system -- by automated measures.

I have a strong belief that git is a core tool we should be using for data

It isn't, we shouldn't, and you're not the first and won't be the last person to put time into this. It's neither a compelling solution nor even a particularly good one.

Gee thanks great critique much appreciated. +1