Hacker News new | ask | show | jobs
by WestCoader 55 days ago
Nothing is pissing me off more than GitHub's stability going down the tubes RIGHT as work is migrating everything, and I mean everything, from CircleCI to GH.

The wildest thing is that Azure Repos/Pipelines was better than this.

Their one caveat is also that they are still migrating it to Azure infra, so it's possible that's still in a one foot in one foot out kinda scenario, from what I've heard. But, this isn't inspiring confidence.

15 comments

They're claiming a huge increase in traffic due to vibe coded projects. It might just be an excuse, but it certainly seems plausible to me.
Could be. But 99% of the repos are static garbage with no PR nor actions.

They mentioned they have some elasticsearch reindexing going to, I would guess they needed to regard or move stuff and something didn't work well. But if I understood it right they mentioned the PRs ES index which they didn't shared proof increased as the number of repos.

It might be anything. It seems they lost huge chunks due to layoffs and structural changes and MS which has the reverse golden Midas touch.

This is just pure speculation but also now there is no reason for MS to keep GH working. They absorbed all code they wanted. Now they can let it burn. Would be even better for them if that happened

> Could be. But 99% of the repos are static garbage with no PR nor actions.

But the 1% of repos that do have PRs and actions are likely going to be seeing enormous increases in volumes

I have been a part of two very large companies with self hosted gits and I've seen enough to be confident that this is an incredibly hard thing to manage

Ya but they are owned by freaking microsoft and have billions of dollars and employees to throw at the problem. The outage problems shouldn't be happening period.
Easy to say that! Some problems are legitimately hard to solve though. Github is likely seeing usage patterns that have never been seen before and I bet some of these failure modes are novel

If you are at the limits of your architecture you may need to re-write things, and if you are rewriting things you can not arbitrarily speed that up by throwing dollars at it.

That's entirely a predicament of Microsoft's own making, though. Don't forget that they're the ones who launched "AI" programming into the hype cycle to begin with. So it's entirely reasonable to hold them as a company responsible for the resulting outages, which indeed shouldn't be happening. Dogs, fleas, and so on.
It is not like MS is involved with AI and say they can make anything in minutes with AI too
Serious question, have you been part of an org that had to scale orders of magnitude very quickly?

Anyone who has been part of that journey knows how painful it really is. A lot of times the systems to fail at all levels, and you have to redesign it from the first principles.

> Serious question, have you been part of an org that had to scale orders of magnitude very quickly?

I have, but it depends what you mean.

Scenario 1: e-commerce SaaS (think: Amazon but whitelabel, and before CPUs even had AES instructions); Christmas was "fun".

Scenario 2: Video Games. The first day is the worst day when it comes to scale. Everything has to be flawless from day 0 and you get no warning as to what can go wrong.

Yet, somehow, I managed to make highly reliable systems.

In scenario 1; I had an existing system that had to scale up and down with load, this was before there was cloud and hardware had a 3-4 month lead time, so most of the effort was around optimising existing code, increasing job timeouts and "quenching" sources that were expensive. We used to also do so 'magic' when it came to serving requests that had session token or shopping cart cookie.

In scenario 2; we have a clean-room implementation and no legacy, which is a blessing but also a curse, there's no possibility to sample real usage: but you also don't need to worry about making breaking changes that are for the better. With legacy you have to figure out how to migrate to the new behaviour gradually.

So, pro's and con's... but it's not like handling huge load hasn't been done before, computers are faster than they ever have been and while my personal opinion is that operational knowledge is dying (due to general distain for people who actually used to run systems that scale: not just write hopeful "eventually consistent" yaml that they call deterministic) - the systems that do exist today hold your hand much better than they did for me 20 years ago.

And I ran 1% of web traffic with an ops team of 5 back then. So, idk what's going on here.

EDIT: Likely people are flagging me because I sound arrogant (or I hurt their feelings by talking bad about YAML-ops), but all I am doing is answering the question presented based on my experience.

It really, really depends on what you mean. Specifically, it depends on the application and its various compute, I/O and access patterns. Scaling ecommerce and games is well-known by now (e.g. Amazon and Blizzard have been dealing with insane scale for two decades now.) However, anything outside a well-known pattern can be very tricky to scale.

I once worked on a team had to 100x scale a system whose downstream dependencies were various 3rd party APIs and data sources, most of which had no real SLAs to speak of and had extremely high variance in latencies and data transfer patterns. This basically required rearchitecting everything including our clients because the typical transactional request/response access pattern was too tightly coupled, and any hiccup in an external API quickly rippled up through the call-tree and caused outages 3+ services removed from ours. In some cases, the re-architecting went all the way to the UI.

Years later, I led a company-wide effort to optimize our entire user-facing application infrastructure to not fall over from sharply spiking user traffic, touching dozens of services across dozens of teams. We did a brief study and realized there was not a single common solution recommendation (like "tune your caches") we could give that would help all the teams because each one had very different resource usage patterns and hence different bottlenecks. Our approach was basically to farm the task out to each team and say "here are some common metrics to look into and some common issues to look for and some common solutions, get back to us if you need help." We spent a lot of time on the help.

I have no idea what the patterns for GitHub are, but I'll note its much more than just a DB and it has a dependency (Actions) with extremely high variance in latencies and resource usage.

I wrote this in response to the below comment, which is now edited and unfortunately dead, so posting here:

I understand, that wasn't a comment on your efforts back then, just that it is a solved problem today. But that does not mean other scaling problems are comparable or comparably solved. The universe of scaling problems is immense!

Worse, different problems occur at different scales. In the 3rd party API system, years after the first re-architecting, some use-cases developed issues at scale that exceeded the already high operational parameters we benchmarked at, and required us to re-architect the service again, including building out a whole new cluster so we could isolate that traffic entirely.

It is really hard to predict how things will break until they do.

(As an aside, I remember reading a lot of interesting things about Blizzard's technology, even if Blizzard didn't publish those themselves. There were many people who researched their products and published their findings. For instance, someone analyzed wireshark traces and published a very detailed report about how they tuned their server-side networking stack. One thing that stood out was Blizzard used TCP for WoW, whereas the conventional wisdom was UDP for real-time multiplayer!)

The root comment asked if I'd been part of an org scaling orders of magnitude quickly, so I'll actually answer it: Venda at Christmas peak (pre-cloud, hardware on 4 month lead times, ~1% of global web traffic at peak) and The Division at launch (new IP, day-zero always-online AAA, ops team of 2). Different shapes, same playbook, both worked. So with the credentialing question out of the way..

GitHub's own April post-mortem names the causes in their own words: tight coupling allowing localised failures to cascade, and inability to shed load from misbehaving clients. Their March report says one of the March outages "shared the same underlying cause" as a February one - i.e. they hit the same rake twice in two months. Cascade isolation has a dedicated chapter in the SRE book from 2016. Load shedding is older than that; the Erlang/OTP people were writing about it in the 80s. This isn't research territory, it's a syllabus, and GitHub is fumbling it with Microsoft's chequebook behind them.

Amazon and Blizzard aren't the slam-dunk examples you want them to be either. Prime Day 2018 fell over because their auto-scaling failed and they had to manually add servers - that's not "well-known by now", that's a company at literal planetary scale getting caught short on the one day of the year it was guaranteed to matter. And Blizzard's Lord of Hatred launch this week is doing the exact same login-queue routine Diablo's done at every launch in living memory. If those are your "two decades of solved problems", the bar is on the floor.

Your 100x rearchitecture story actually argues my position, by the way. You described tight coupling causing cascading failures across services, and the fix was to decouple. That is the boring operational discipline I'm saying has atrophied - you and your team did the work. The point is GitHub, a decade later, with Microsoft's resources and thirty times the headcount, is putting out post-mortems that read like undergraduate distributed systems coursework.

So no - the question isn't whether GitHub's problem is hard. Every scaling problem looks hard from inside. The question is whether the operational discipline that solved this class of problem in the 2000s and 2010s is still being practised, or whether the industry has quietly decided "it's complicated" is sufficient cover.

I think you meant "green fields" and not "clean room"? Clean room refers to reverse engineering an existing program to create specifications, then having another team implement the specifications without legal risk from involving the original.
Yes I did, sorry! You are right. :)
Is GitHub scaling by orders of magnitude though? That would be an insane increase at this stage of their lifecycle.
They say it is at least one order of magnitude[1]; "our plan to increase GitHub’s capacity by 10X in October 2025 .. By February 2026, it was clear that we needed to design for a future that requires 30X today’s scale."

[1] https://github.blog/news-insights/company-news/an-update-on-...

Note the lack of concrete numbers on how much they have scaled. Somebody may have just asked an LLM for projections.
I wouldn't be surprised. Have you not noticed the sheer volume of slop being posted everywhere these days? Almost all of that is hosted on Github. And some of those repos have insane commit frequencies.
If they're suffering the onslaught of ai slop, it's possible.
> you have to redesign it from the first principles

And that start by layoffing your best engineers, I guess

At that point, make it lazy indexing? Who cares that I can't find a repo that was made 10 seconds ago, or even 15 minutes ago? No seriously, who cares? Search to that level of nuance is not mission critical, I don't care what anyone says, you'll live if you wait another 15 minutes or even an hour. Their search has been terrible since their last major set of search changes where they overhauled it completely either way.
They're claiming a huge increase in traffic due to vibe coded projects. It might just be an excuse, but it certainly seems plausible to me.

I simply do not care.

Customers pay for a service. If they don't get what they paid for, it's perfectly reasonable and normal to go elsewhere.

Why do people on HN keep apologizing on the behalf of trillion-dollar companies?

I mean, what will happen here is people will go to other services and they'll get overloaded too.

Self hosted is probably the way to go, but hardware prices are insane currently.

They can claim that...but if you've built a public SaaS before you know the job is not to host the software, it's to put rails around people taking it down. They've had since 2008 to build those rails, and they're just now hitting places that take the service down on the regular?
The problem is that they are charging per seat and need to start charging for usage.
Yep, definitely more traffic and also more new Github repos being created, with a pretty huge spike the last 2 months [1]

[1] https://bloomberry.com/data/github/

Probably true. GH Enterprise Cloud is mostly 100% uptime over the past 90 days.
I’d be shocked if this wasn’t the reason.
Two weeks ago I had a commission to explore migrating from selfhosted gitlab to github for better AI integration. Last night that project was cancelled due to github outages and we're going to upgrade the self hosted server instead. I'd be tempted to use something like forgejo but there are a dozen devs and honestly I've only ever used it solo.
Out of interest: what does "better AI integration" mean? Any specific functional or non-functional requirements?
I didn't challenge them on this but it's because of Claude integrations with github. I'm not sure what that gives them over just running it against the codebase, but I didn't want to lose the opportunity to finally move them from that EoL server
I would try to sell it internally. The interface is not that different, and I had good experiences myself stability-wise.
Azure repos are kinda fine. It's really basic and there is nothing to break. I actually really really like their ticketing thingy for the same reason. It has the necessary stuff and the management types can't add a million of fields to it and annoy me with reporting, burndown charts or what not.
Yea, I have Azure DevOps with free action minutes and I’ve started using it a ton more since it avoids all GH outages.
It has an annoying bug where approving PR's from the cli won't delete branches when you squash commit, while clicking the button in the UI does it perfectly fine. It's been a bug for a while (as in several years), and if you find something like that, don't expect it to ever be fixed. As a whole it's not a bad tool though.

As you say it's limited, but that can be both good and bad.

You can cancel the migration, no need for sunk cost fallacy.
"You" can't necessarily do anything (you would be making a lot of assumptions about the influence this person has over the decision making process).

"Someone" can cancel the migration. "Someone" just won't.

i might be connecting unrelated dots, yet when i read "migration to Azure" this came back to me

https://news.ycombinator.com/item?id=47616242 https://isolveproblems.substack.com/p/how-microsoft-vaporize...

I'm on the other side of the fence. We're just about done migrating from GitHub to GitLab (self-hosted) and it's been refreshing to DGAF about any of the GH outages I read about.
Similar boat myself too, finished moving all important stuff from GitHub to self-hosted Forgejo with cross-platform builds. Not only do I avoid all the downtime stuff, but E2E builds also takes ~20% of the completion time it used to take, since now my runners have dedicated hardware hosted at home.
To maintain a fair comparison, GitHub has supported self-hosted runners for several years (maybe that doesn’t work for your specific usage, for whatever reason).
> To maintain a fair comparison, GitHub has supported self-hosted runners for several years

Yeah, tried that first, as I didn't want to move to Forgejo, I just wanted to keep working when I wanted to work.

The GitHub runner on Linux seemed fine, but the ones for macOS and Windows seemingly did something that made them a hell lot slower than even running VMs and then executing stuff inside those. I'm not sure what the runner is doing, if there is some built-in sandboxing or what not for those platforms, but it wasn't feasible to rely on for me as the builds took way too long time.

We were on self-hosted Gitlab but after a merger were forced to Github. Navigation feels painful in comparison and basic features such as commit graph are now behind more expensive tiers.
> We were on self-hosted Gitlab but after a merger were forced to Github. Navigation feels painful in comparison and basic features such as commit graph are now behind more expensive tiers.

Same experience here. Add to that that even on Enterprise tier:

- 1 Enterprise : 1 namespace - although you can segment it with Orgs, we were advised not to do it because we're too small (~2k people) (GL: groups, subgroups, sub-subgroups, ...)

- SSH deploy keys are singletons across the entire instance and repo-bound (and Weblate for instance can only use its own key), so you need a service account for that (GL: instance-wide SSH deploy keys that you can activate in specific repos)

- GHCR only really supports classic PATs for authentication ( https://docs.github.com/en/packages/working-with-a-github-pa... - GL: proper deploy keys properly inherited throughout the hierarchy)

So all in all the experience so far is a huge step-down. I really liked pinning commonly accessed pages in the sidebar.

Interesting! I worked with Gitlab and I also thought it was quite clunky. If it was not for the stability issues GitHub is fine. Any other alternatives to GH or GL?
Self-hosting with open source code:

- SourceHut: https://sr.ht/~sircmpwn/sourcehut/

- Forgejo (used by Codeberg, etc.): https://forgejo.org/

SourceHut never really clicked for me. It doesn't give me anything useful that I don't already have in a bare git repo through a ssh.

Forgejo, on the other hand, is a drop-in replacement for GitHub.

Also:

- https://about.gitea.com/ (F/OSS MIT license self hosting GitHub like instance)

We switched from Bit bucket to Gerrit internally and it was a steep learning curve for the des but it's fine.

At a customer we're implementing GitHub Actions and even on our Dev environment there are so many hickups with GitHub.

Gitea might be an option also.
Jira / Bitbucket / Teamcity.

Might be pricy though.

Having used Teamcity for CI I cannot think of a more clunky and hard to use system (compared to GHA, which is what we migrated to).
We ended up an Azure Pipelines kinda by default because it was there and mostly paid for with the intention of later migrating, but it's been fine. Boring but stable and functional.
Mee too. We just did a very similar migration at work it's incredibly frustrating, I've got all my CI ported over and now this.

MSFT should just create slophub.com they'd make money im sure.

Honest question, why are companies interested in hosting on github?

As a private person I use it too as a free hoster, but from work I mainly know self hosted instances of jenkins and TeamCity.

I think you’ve got it backwards. GitHub is by far the market leader for hosted repositories and maybe for CI too. This is like asking “Why are companies interested in using AWS?”

When one firm is so dominant for so long, the question is more like “Why shouldn’t we just use GitHub like 80% of software companies do?”

The issues they’ve had are almost all very recent. Very few companies have reevaluated that decision, because moving a big and well-integrated part of infrastructure is a huge project that delivers no value to the business. Speculating that you’ll have fewer development-slowing outages is not the most convincing when asking for the budget to do this. Plus, self-hosted isn’t necessarily going to have better uptime - mistakes happen.

I think before Actions, it would have been a lot easier to migrate off GH though. You’d just need to change a lot of repo URLs and find a way to set up webhooks from the new place to poke CI. Now with Actions, a lot lives in GH and in a proprietary flavor that doesn’t just ‘lift and shift.’

> I think you’ve got it backwards. GitHub is by far the market leader for hosted repositories

Maybe, but I never heard about any company using github for internal projects in my real life. For me it was always to go to for open source projects.

Then again it's not a topic that often comes up in my developer circles.

I think I worked at one company that used BitBucket instead of GitHub but GitHub has been the main place for the internal company repos everywhere else I’ve worked. GitHub is quite popular for any sort of git hosting.
More than half the companies I have worked for use Github. The others used Atlassian tools which were at least as bad from a reliability perspective and much less nice to use (IMO).
For me it’s been every company since 2010.
> The issues they’ve had are almost all very recent.

It has been bad for at least 18mo, maybe longer? I recall multiple work impacting outages at my previous employer extending back into 2024. Maybe even earlier than that?

In the lifespan of Git and GitHub that’s all very recent.
But on the timescale of tech it's a long time. When a service has serious availability issues for like 7 quarters straight that's.. I mean, it's why we're all talking about it.
> Honest question, why are companies interested in hosting on github?

Mostly boils down to marketing and easier to establish a community. Almost every developer has an account there, leading to network effects being much larger, so if you're a new FOSS project, finding contributors and getting your project in front of other's eyes is much easier when you're on GitHub compared to your own Forgejo instance.

With that said, I'd question if chasing "most external one-time contributors" or GitHub stars is the right way to actually run a FOSS project, personally I'd avoid thinking about those vanity-numbers as much as possible and focus on the project, code and contributors themselves.

But, I've literally heard those two arguments for "why GitHub" countless of times over the years.

Oh FOSS projects I totally understand. It's where I go to too.

But closed source companies surly don't need to establish a community?

Go with the flow, don't rock the boat and use what developers already know, are probably the most cited reasons I've heard.

I've tried so many times in the past to argue for self-hosted setup that you fully control if you can afford it, things just get so much smoother and if you're a software development company, you probably want to own the software development workflow E2E so you can actually ship as fast as you want.

I’ve argued the opposite most of the time in build vs buy. Buy in almost every case unless it’s a real competitive advantage to you.

I know developers love to build, but do you think:

1) self-hosting git provides any competitive edge to the business over letting someone manage it?

2) it provides so much value that you’re willing to fund engineers to build, secure, support this on an ongoing basis?

I’ve found the answer to those is No in both cases.

The same reason you wouldn’t build your own internal chat tool, you’d use Slack. And you wouldn’t bother self-hosting your own Jira or documentation.

Code hosting is code hosting, there’s no difference where it's hosted. There’s no slowdown in delivery with using GitHub - their March uptime was 99.5% which annoys some commenters but it’s fine. That’s 45 minutes downtime per month which is tolerable.

You would spend way more effort and money building a jenky self-hosted solution to end up with a worse result.

Usually, at large enough corporations, it's one of two things. Some random project gets open sourced, and it ends up on Github(see, for example, Salesforce) - or, more commonly, some subsidiary or acquisition had github and has either refused to migrate to the internal source system or the hassle of migration isn't worth it.
Most developers have experience using GitHub. The UI and concepts are familiar. The friction for adopting features like Actions is relatively low.
> The UI and concepts are familiar.

I guess, but it's not like you can't learn how to create a pullrequest on bitbucket or how to create an issue on jira as well within a work day?

That seems like the smallest thing when switching to a new company.

> The friction for adopting features like Actions is relatively low.

Yeah, I know almost nothing about the CI integration and actions when it comes to Github. Will look into it. Thank you.

At one point it was also used as signaling that a company was “modern.”
I don’t know why you would even really need hosted git or why you’d be affected by its downtime. Git is decentralized by design. One node going down should not stop development. You don’t need a “central hub” to keep working.

I guess it’s all the other non-git stuff like issue tracking and other (unfortunately) centralized products on GitHub that causes disruption when they go down.

Weird how GitHub built itself around a distributed VC system and then made all its other services centralized.

> I guess it’s all the other non-git stuff

Yes, you want to run automated builds, unit test, end to end test, UI tests, make it easy for testers to deploy specific versions / tags to internal server. Also kick off builds for iOS on mac computers. We use Teamcity for that.

Tracking of issues, feature and epics. Maybe also knowledge base / wiki. We use Jira.

And pull requests. Bitbucket.

Onboarding construct workers is super easy.
If you’re making this change now, I wonder how the technical leadership evaluated GitHub and its competitors.. and then still landed on GH.

What made it better than e.g GitLab?

> The wildest thing is that Azure Repos/Pipelines was better than this.

Wow; that's extremely damning because that service was garbage IME.

i did the circleci --> github actions migration for my job 1.5 years ago, and things seemed great... at first. at the time, we'd been dealing w/circleci's semi-regular (but thankfully short) outages for over two years, and we were excited to move to a more stable system.

now i'm considering deploying jenkins.

From CircleCI here. A big effort and investment went into resolving those outage issues you're referring to. Results have been stellar for a while now. Here's the latest: https://status.circleci.com
Apologies for shameless plug... have you considered Buildkite? Our statuspage is a sea of green even as we hit 1.3b minutes/wk (GHA sits at 2.1/wk now). Much less maintenance overhead than Jenkins, more dynamic featureset. The trial is all-access, unlocks the full product and can be extended past 30 days. Real human eng on standby throughout.
Out of interest: what VCS where you using with CircleCI? (as CircleCI is VCS agnostic)
Artifacts - C'mon Wit Da Git Down

https://www.youtube.com/watch?v=Js_Y_q-IkYo

Why do you care about github? It’s Just another corporation doing what they know best: harvesting money. The software ecosystem can live without github just fine
Github uptime down to 86% according to https://mrshu.github.io/github-statuses/ (not my website)