Hacker News new | ask | show | jobs
by BeefySwain 1087 days ago
Sidestepping all of the ethical questions of embarking on this "research", I'm surprised the number was that low.

Linux[0] itself has about 1.2 million commits, so apparently Linux is within an order of magnitude of bringing GitHub to it's knees?

[0] https://github.com/torvalds/linux

4 comments

Microsoft’s azure docs repo has 1.1M commits, and it’s many gigabytes big. I made the mistake of trying to clone it to fix an issue in the docs I ran into. Ended up just editing it on GitHub because fuck that.

https://github.com/MicrosoftDocs/azure-docs

You can clone a few latest commits

  git clone -–depth [depth] [remote-url]
I dont think that works:

    > git clone --depth 1 https://github.com/MicrosoftDocs/azure-docs
    Cloning into 'azure-docs'...
    remote: Enumerating objects: 107158, done.
    remote: Counting objects: 100% (107158/107158), done.
    remote: Compressing objects: 100% (101843/101843), done.
    Receiving objects:  17% (18217/107158), 780.25 MiB | 43.72 MiB/s
I think it’s a rate issue, not the number of commits.
iirc remember some years ago the homebrew repo caused too much load due to their architecture where every client would pull on install or update. Or something like that.

Part of the GitHub response afaik included the info that they went as far as they could with dedicated and beefier servers but asked for a software fix.

I would think that if GitHub anticipates a normal repo growing this large they can give it the special treatment

There's a rough rule of thumb that you should expect to redesign your system to handle each order of magnitude increase in scale, and I figure it applies here too—gracefully handling that size of repo would require substantial engineering work, and they have plenty of time to handle it before human-oriented open source repos get even close to the current limit.
I'm not sure redesigns were necessary between going 1 to 10, from 10 to 100, from 100 to 1000, from 1000 to 10'000, from 10'000 to 100'000, or from 100'000 to 1000'000 which we're now at. It sounds like a sensible engineering rule, but I'm not sure it translates to software, or at least not in this case. I don't know of any design changes made to Git since it was first created, there's no v1 and v2 repositories for example.
It depends on how quickly you pass through each order of magnitude milestone. I remember reading about how MySpace grew something like five orders of magnitude in less than a year, and no matter how scalable your architecture is you're going to hit a point during that where you need to rearchitect your whole system.

Slower growth allows for forward planning and incremental architectural changes.

> there's no v1 and v2 repositories for example

We wouldn’t know. GitHub is probably running something very different to normal local git including optimizations for performance and cost.

They must only ensure API/protocol compatibility and could have already replaced everything else many times over.

> There's a rough rule of thumb that you should expect to redesign your system to handle each order of magnitude increase in scale

I rather know the rule: by good engineering, you can modify a system to handle a one magnitude increase with respect what it was designed for. As soon as a two magnitude increase can occur, you better redesign the system.