Hacker News new | ask | show | jobs
by dognotdog 996 days ago
I keep coming back to fossil again and again, despite git having a huge pull because of the easy publishing and collab on github/gitlab.

Just the other day I was starting an exploratory project, and thought: I'll just use git so I can throw this on github later. Well, silly me, it happened to contain some large binary files, and github rejected it, wanting me to use git-lfs for the big files. After half an hour of not getting it to work, I just thought screw it, I'll drop everything into fossil, and that was it. I have my issue tracker and wiki and everything, though admittedly I'll have some friction later on if I want to share this project. Not having to deal with random git-lfs errors later on when trying to merge commits with these large files is a plus, and if I ever want to, I can fast-export the repo and ingest it into git.

2 comments

It is extremely rare that I have a file over 100MB.

I also think it’s one of those situations where if I have a giant binary file in source control “I’m doing it wrong” so git helps me design better.

It’s like in the olden days when you couldn’t put blobs directly in a row so databases made you do your file management yourself instead of just plopping in files.

I like git. I don’t like giant binary files in my commit history. It’s cool that you like fossil, but I don’t see this as a reason for me to use it.

You didn't put blobs directly in the database because of annoying database limitations, not because there's a fundamental reason not to.

It's the same with Git. Don't put large files directly in Git because Git doesn't support that very well, not because it's fundamentally the wrong thing to do.

There should be a name for this common type of confusion: Don't mistake universal workarounds for desirable behaviour.

The fundamental reason had to do with how rdbms structured its pages of data and having arbitrary sized blobs directly in the record broke the storage optimization and made performance tank.

It was a design constraint back in the day.

I haven’t looked at this in decades, but I think now it’s all just pointers to the file system and not actually bytes in the record.

So it was fundamentally the wrong thing to do based on how databases stored data for performant recall.

But that’s back when disks were expensive and distributed nodes were kind of hard.

> I think now it’s all just pointers to the file system

It depends. InnoDB, assuming the DYNAMIC row type, will store TEXT/BLOB on-page up until 40 bytes, at which point it gets sent off-page with a 20 byte pointer on-page. However, it comes with a potentially severe trade-off before MySQL 8.0.13: any queries with those columns that would generate a temporary table (CTEs, GROUP BY with a different ORDER BY predicate, most UNIONS, many more) can’t use in-memory temp tables and instead go to disk. Even after 8.0.13, if the size of the temp table exceeds a setting (default of 16 MiB), it spills to disk.

tl;dr - be very careful with MySQL if storing TEXT or BLOB, and don’t involve those columns in queries unless necessary.

Postgres, in comparison, uses BYTEA as a normal column that gets TOASTed (sent off-page in chunks) after a certain point (I think 2 KiB?), so while you might need to tune the column storage strategy for compression - depending on what you’re storing - it might be fine. There are some various size limits (1 GiB?) and row count limits for TOAST, though. The other option is with the Large Binary Object extension which requires its own syntax for storage and retrieval, but avoids most of the limitations mentioned.

Or, you know, chuck binary objects into object storage and store a pointer or URI in the DB.

In the age of Large Language Models, large blobs will become the rule, not the exception. You’re not going to retrain models costing $100M to build from scratch because of the limitations of your SCM.
I don’t store those in my scm. It’s not a limitation of my scm that I can’t store a 20gig model directly in the repo.

So you’re right, I’m not going to retrain models costing $100M because of SCM limitations. That’s because I don’t have any SCM limitations.

I fail to understand people that can't be bothered to empathize with other use cases than their own. Game development usually has a large number of binary assets that need to be in source control, does that sound like a reasonable use, or are they also doing it wrong?
GF is working for a startup doing a game. They were using git and dumped it because it just cannot deal. Also the content people found it annoying without providing any value what so ever.
> if I have a giant binary file in source control “I’m doing it wrong” so git helps me design better

Your VCS should not be opinionated, that is not its job

Source control is all about managing diffs. Large files are fine, binary doesn’t make sense. Most of the time binary file diffs aren’t human readable.

I store binary files outside of git but keep build logs containing binary file CRCs on git

> Source control is all about managing diffs. Large files are fine, binary doesn’t make sense

In git, diffs are literally just a UI thing.

That's not really true, is it? Surely Git does have an internal concept of diffing changes, specifically so it knows whether two commits can be merged automatically or if they conflict (because they changed the same lines in the same file).
> That's not really true, is it?

It is.

> Surely Git does have an internal concept of diffing changes

Not in the data model. Packing has deltas, but they're not textual diffs, and they would work fine with binary data... to the extent that the binary data doesn't change too much and the delta-ification algorithms are tuned for that (both of which are doubtful).

> specifically so it knows whether two commits can be merged automatically or if they conflict (because they changed the same lines in the same file).

Conflict generation & resolution is performed on the fly.

Most binary files that people want to store in a VCS are stuff like .psd, .xlsx, .docx, and the like - data that's created by people by hand, but not stored as text.
Xlsx and docx are just zipped up xml text. You can store it as text if you like and I think there are many git modules to handle this. But the xml isn’t really that diffable so I don’t bother.
if you have large files in your repository, you have a design problem.
Not in gamedev where you can have hundreds of gigs of art assets (models, textures, audio...), but you still want to version them or even have people working on them at the same time (maps...). But that is a different can of worms entirely.
Indeed I have 3D assets in this case. Would this be done differently in an enterprise that has all kinds of tools to manage specialty workflows? Sure. Do I want to spend my days configuring and maintaining some binary blob / LFS storage system? No.

I’ve migrated a lot of projects from fossil to git eventually, but I dare say they never would have made it that far, had I started out with more friction, including fighting vcs tools.

git is for when you want to track changes to part of a file.

In your scenario, you just want to track different versions of a file.

You can equally say that git is for when you want to track changes. And then it's a failing of git.

Besides, what's the difference? It's a file. The contents changed. Git doesn't say anything at all along the lines of "30% or more different means it's not a good fit for git".

You can certainly use git, but then the model used to apply patches from different branches doesn't work for binary blobs.

So all the things that git specializes in are actually things you don't want.

That seems like an implementation detail that could change tomorrow, at which point it could be perfectly fine to store large blobs in your repository, yea?

I completely agree Git is bad at this now, to be clear. I've watched single-file repositories bloat to hundreds of gigabytes due to lots of commits to a single 1MB file. But that doesn't seem like a design problem, just implementation.

Then use a binary repository like Artifactory which has LFS support. You can still version them in git- just don’t store them in there
> Then use a binary repository [...] You can still version them in git- just don’t store them in there

So git works, as long as you include things that aren't git to handle what it can't.

Stockholm syndrome really blinds people

Not sure if you were agreeing with me or not BUT I run into this often where people do not use the right tools and try to make one tool fit every CM scenario. SharePoint sucks but it has its place. Along with Artifactory/Nexus
That's what object storage with versioning turned on is for e.g. GCS or S3
Although blob storage work well for versioning, you have to make heavy use of the underlying proprietary API to get these versions, and I am not quite sure you can do more complex operations, like diff and bisect between those versions the way you could with git.
Why use git at all then? Just use an object store with versioning turned on.
Because git excels in relatively small size text files and patching and difficult. You can't binary blobs like jpegs, audio, video easily.
But that's my point: why can't a version control system be good for this as well? It's the same thing underneath. Why do we have to split these different use cases across different tools and hope a foreign key constraint holds?
That's a ridiculous claim. Can you really not think of a single situation in which it makes sense to keep track of big pieces of data alongside (or even instead of) source code? The fact that many VCS don't handle large binary data nicely doesn't mean there's never a good reason to do so.
> Can you really not think of a single situation

It doesn't even matter if they can think of one; assuming your own use cases for software are everyone's is proceeding from false premises and is the sort of thing that leads to (and necessitates) "hacky workarounds" and eventually the adoption of better software we should've had in the first place.

Assume nothing about user's use cases. A VCS should not be imposing arbitrary limitations on the files it's indexing. It's like the old-school filesystems we (surprise, surprise) deprecated.

> if you have large files in your repository, you have a design problem

Your workflow and use cases aren't everyone's.

A workflow design problem?