Hacker News new | ask | show | jobs
by prepend 1001 days ago
It is extremely rare that I have a file over 100MB.

I also think it’s one of those situations where if I have a giant binary file in source control “I’m doing it wrong” so git helps me design better.

It’s like in the olden days when you couldn’t put blobs directly in a row so databases made you do your file management yourself instead of just plopping in files.

I like git. I don’t like giant binary files in my commit history. It’s cool that you like fossil, but I don’t see this as a reason for me to use it.

5 comments

You didn't put blobs directly in the database because of annoying database limitations, not because there's a fundamental reason not to.

It's the same with Git. Don't put large files directly in Git because Git doesn't support that very well, not because it's fundamentally the wrong thing to do.

There should be a name for this common type of confusion: Don't mistake universal workarounds for desirable behaviour.

The fundamental reason had to do with how rdbms structured its pages of data and having arbitrary sized blobs directly in the record broke the storage optimization and made performance tank.

It was a design constraint back in the day.

I haven’t looked at this in decades, but I think now it’s all just pointers to the file system and not actually bytes in the record.

So it was fundamentally the wrong thing to do based on how databases stored data for performant recall.

But that’s back when disks were expensive and distributed nodes were kind of hard.

> I think now it’s all just pointers to the file system

It depends. InnoDB, assuming the DYNAMIC row type, will store TEXT/BLOB on-page up until 40 bytes, at which point it gets sent off-page with a 20 byte pointer on-page. However, it comes with a potentially severe trade-off before MySQL 8.0.13: any queries with those columns that would generate a temporary table (CTEs, GROUP BY with a different ORDER BY predicate, most UNIONS, many more) can’t use in-memory temp tables and instead go to disk. Even after 8.0.13, if the size of the temp table exceeds a setting (default of 16 MiB), it spills to disk.

tl;dr - be very careful with MySQL if storing TEXT or BLOB, and don’t involve those columns in queries unless necessary.

Postgres, in comparison, uses BYTEA as a normal column that gets TOASTed (sent off-page in chunks) after a certain point (I think 2 KiB?), so while you might need to tune the column storage strategy for compression - depending on what you’re storing - it might be fine. There are some various size limits (1 GiB?) and row count limits for TOAST, though. The other option is with the Large Binary Object extension which requires its own syntax for storage and retrieval, but avoids most of the limitations mentioned.

Or, you know, chuck binary objects into object storage and store a pointer or URI in the DB.

In the age of Large Language Models, large blobs will become the rule, not the exception. You’re not going to retrain models costing $100M to build from scratch because of the limitations of your SCM.
I don’t store those in my scm. It’s not a limitation of my scm that I can’t store a 20gig model directly in the repo.

So you’re right, I’m not going to retrain models costing $100M because of SCM limitations. That’s because I don’t have any SCM limitations.

I fail to understand people that can't be bothered to empathize with other use cases than their own. Game development usually has a large number of binary assets that need to be in source control, does that sound like a reasonable use, or are they also doing it wrong?
GF is working for a startup doing a game. They were using git and dumped it because it just cannot deal. Also the content people found it annoying without providing any value what so ever.
> if I have a giant binary file in source control “I’m doing it wrong” so git helps me design better

Your VCS should not be opinionated, that is not its job

Source control is all about managing diffs. Large files are fine, binary doesn’t make sense. Most of the time binary file diffs aren’t human readable.

I store binary files outside of git but keep build logs containing binary file CRCs on git

> Source control is all about managing diffs. Large files are fine, binary doesn’t make sense

In git, diffs are literally just a UI thing.

That's not really true, is it? Surely Git does have an internal concept of diffing changes, specifically so it knows whether two commits can be merged automatically or if they conflict (because they changed the same lines in the same file).
> That's not really true, is it?

It is.

> Surely Git does have an internal concept of diffing changes

Not in the data model. Packing has deltas, but they're not textual diffs, and they would work fine with binary data... to the extent that the binary data doesn't change too much and the delta-ification algorithms are tuned for that (both of which are doubtful).

> specifically so it knows whether two commits can be merged automatically or if they conflict (because they changed the same lines in the same file).

Conflict generation & resolution is performed on the fly.

Most binary files that people want to store in a VCS are stuff like .psd, .xlsx, .docx, and the like - data that's created by people by hand, but not stored as text.
Xlsx and docx are just zipped up xml text. You can store it as text if you like and I think there are many git modules to handle this. But the xml isn’t really that diffable so I don’t bother.