Hacker News new | ask | show | jobs
by taylorlafrinere 3318 days ago
Most of that 300GB isn't text. There are test assets, images, videos, built binaries, vhd's, etc. Also, I should be clear that that 300GB is just at tip (no history). We can debate about whether or not those things should be checked into the repo but they are there now.
2 comments

How did you go about creating the central repo and how long did it take? A 2Gb at tip svn repo with 100k commits is taking me many days and each odd failure typically has me restart the process after filtering out some obscure part of the tree.

Edit: read in another comment that you dropped the history. Understandable, but can appreciate how that would add to friction (devs having to look through two different histories).

The Windows team developed a tool called "GitTrain" that knew how to:

- migrate the tip of a branch to Git (yes, the 300GB number is the tips of all the interesting branches, not the history)

- keep a Git branch and a SD branch in sync for a while

- be re-run over each of the 400+ branches they care about

But they went through some of the same trial-and-error process that you're describing.

Whoa. 300GB with a shallow clone?! What size does the whole repo use on the server side?
The pack file size for a full clone is 187GB. The 300GB is the working directory. We did not import the history of the code base, so the current repo only has about 5 months of history. As others have called out, there are a lot of assets in the repo that don't compress.
Why only 5 months? Will more of the history be added to the git repository eventually?
No, we'll keep the SD servers around for a while for servicing older products. We also have a "breadcrumbing" system that lets an engineer follow a file's history back from Git to the old system.
Was importing the complete history tried during the development? This is very interesting. The git history will grow at break-neck speed and will reach similar size soon enough. Is this to delay the inevitable tech wrangling for dealing with terabyte histories or were there issues with the import/sync?

Or maybe it was just the initial repo setup used for alpha testing that got promoted to production :)