Hacker News new | ask | show | jobs
by ch71r22 1285 days ago
What are the reasons against?
1 comments

The lack of reasons for doing it IS the reason against. GIT isn't a magic 'good way' to store arbitrary data, it's a good way to collaborate on projects implemented using most programming languages which store code as plain text broken into short lines, where edits to non-sequential lines can generally be applied concurrently without careful human verification. That is an extremely specific use case, and anything outside of that very specific use case leaves git terrible, inefficient, and gives almost no benefit despite huge problems.

People in ML ops use git because they aren't very sophisticated with programming professionally and they have git available to them and they haven't run into the consequences of using it to store large binary blobs, namely that it becomes impossible to live with eventually and wastes a huge amount of time and space.

ML didn't invent the need for large artifacts that can't be versioned in source control but must be versioned with it, but they don't know that because they are new to professional programming and aren't familiar with how it's done.

I literally don't know anyone or any team in ML using git as a data versioning tool. It doesn't even make sense to me, and most mlops people I have talked to would agree. Is that really the point of this tool? To be a general purpose data store for mlops? I thought it is for very specialized ML use cases. Because even 1TB isn't much for ML data versioning

Mlops people are very aware of tools that are more suited for the job... even too aware in fact. The entire field is full of tools, databases, etc to the point where it's hard make sense of it. So your comment is a bit weird to me

Building mlops solutions for a big tech. Agree, most mature ml teams are not using git for ml data versioning, but in my experience and user research it’s not due to lack of intent. Teams have been forced to move to other ml data tools in absence of scalable git solution, most of which come with a lot of cognitive overhead for ml engineers who don’t want to spend time adopting several custom tools in their ml pipelines.
I think you'll find varying levels of maturity in ML ops. Anyway I think we basically agree, if you use something like this you aren't that mature, and if you are mature you would avoid this thing.
Indeed, there is a lot of pain if you actually try to store large binary data in git. But we managed to make that work! So a question worth asking is how might things change IF you can store large binary data in git??
That is exactly what git-lfs is, a way to "version control" binary files, by storing revisions - possibly separately, while the actual repo contains text files + "pointer" files that references a binary file.

It's not perfect, and still feels like a bit of a hack compared to something like p4 for the context I uses LFS in (game dev), but it works, and doesn't require expensive custom licenses when teams grow beyond an arbitrary number like 3 or 5.

XetHub Co-founder here. Yes, we use the same Git extension mechanism as Git LFS (clean/smudge filters) and we store pointer files in the git repository. Unlike Git LFS we do block-level deduplication (Git LFS does file-level deduplication) and this can result in a significant savings in storage and bandwidth.

As an example, a Unity game repo reduced in size by 41% using our block-level deduplication vs Git LFS. Raw repo was 48.9GB, Git LFS was 48.2GB, and with XetHub was 28.7GB.

Why do you think using a Git-based solution is a hack compared to p4? What part of the p4 workflow feels more natural to you?

The centralised model of Perforce is more of a natural fit for one thing, since by default it allows you to clone subsets, and just the latest version of files. File locking is much more integrated into the p4 workflow as well, in git you can still modify files locally, then commit them. The check happens on push, and sometimes git fails to send the lock notification upstream. Oh and it breaks down entirely if you use branches.

Some of these have workarounds and hacks for more experienced users. I'm not about to run around teaching people the intricacies of arcane git incantations, while p4 functions, by default, how you'd want to. The programming side is better on git though, yeah.

(XetHub engineer here)

We're working on perforce-style locking on XetHub, and I believe git already supports things like only cloning the latest version of files. Cloning the full repo without "smudging" (pulling in binary file contents) is already possible, and cloning while smudging a subset is on our roadmap. We're definitely on a path to making git UX for dealing with large binary files as easy as perforce, and there are lots of advantages to keeping a git-based workflow for teams that already work with git.

I think this is a foot-gun, it's a bad idea even if it works great, and I doubt it works very well. You should manage your build artifacts explicitly, not just jam them in git along with the code that generates them because you are already using it and you haven't thought it through.
I don't think you've made your case here. The practices you describe are partly an artifact of computation, bandwidth, and storage costs. But not the current ones, the ones when git was invented more than 15 years ago. In the short term, we have to conform to the computer's needs. But in the long term, it has to be the other way around.
You're right! It makes way more sense, in the long run, to abuse a tool like git in a way that it isn't designed for and which it can't actually support, then instead of actually using git use a proprietary service that may or may not be around in a week. Here I was thinking short term.
You seem nice.
Xet's initial focus appears to be on data files used to drive machine learning pipelines, not on any resulting binaries.