Hacker News new | ask | show | jobs
by ezekiel68 439 days ago
>And the git data structure... falls apart for large files.

I'm good with this. In my over 25 years of professional experience, having used cvs, svn, perforce, and git, it's almost always a mistake keeping non-source files in the VCS. Digital assets and giant data files are nearly always better off being served from artifact repositories or CDN systems (including in-house flavors of these). I've worked at EA Sports and Rockstar Games and the number of times dev teams went backwards in versions with digital assets can be counted on the fingers of a single hand.

3 comments

Are CAD data not sources in of themselves?

My last CAD file was 40GiB, and that wasn't a large one.

The idea that all sources are text means that art is never a source, and that many engineering disciplines are excluded.

There's a reason Perforce dominates in games and automotive, and it's not because people love Perforce.

I think this conflates "non-source" with "large". Yes, it's often the case that source files are smaller than generated output files (especially for graphics artifacts), but this is really just a lucky coincidence that prevents the awkwardness of dealing with large files in version control from becoming as much of a hassle as it might be. Having a VCS that dealt with large files comfortably would free our minds and open up new vistas.

I think the key issue is actually how to sensibly diff and merge these other formats. Levenshtein-distance-based diffing is good enough for many text-based formats (like typical program code), but there is scope for so much better. Perhaps progress will come from designing file formats (including binary formats) specifically with "diffability" in mind -- similar to the way that, say, Java was designed with IDE support in mind.

Non-source files should indeed never be in the VCS, but source files can still be binary, or large, or both. It depends on how you are editing the source and building the source into non-source files.
Also, some source files that could otherwise be treated as text⁰ end up effectively being binary blobs because tools don't write them in a stable order, which makes tracking small changes difficult because you can't see that they actually are small changes. A number of XML formats¹, and sometimes JSON & others, have this issue too.

----

[0] for the purposes of change tracking and merging

[1] Stares aggressively at SSIS for its nasty package file format² and habit of saving parts of it in different orders apparently randomly so updating the text of an annotation can completely rearrange the saved file

[2] far from the only crime committed by SSIS I know, but one occasionally irritating enough to mention

Could you use git pre-commit hooks or something similar to transform the files by deterministically sorting the items at each level?

Diffoscope does something similar, diff sorted stuff first, then if there are no changes, then report that, and show the unsorted diffs.

https://diffoscope.org/ https://try.diffoscope.org/

> Could you use git pre-commit hooks

Possibly, though I might be concerned that the format has ordering oddities that it is unexpectedly sensitive to. Unlikely, but given how many other oddities DTS/SSIS has collected over the years I'd not be surprised!

Also, we weren't using Git in DayJob at the time we were actively developing with SSIS (maybe VSTS had an equivalent we could have used?), and we are now acting to remove the last vestiges of it from our workflows rather than spending time making it work better with them!

OMG! Please don't remind me about trying to source control SSIS. One tiny change cascades into 1000 lines of source being different. Total nightmare.