What's "the db"? It sounds like something of small to medium scale if you can just restart it like that.
In any case, why not just relocate some vendor engineers on site for a bit? Or, better, why does the vendor not have a small presence in the corner?
Sounds like whatever "the db" is it's probably some (objectively) small but very scary thing that's currently on fire and people are trying to figure out how to put it out without crashing the plane and also making too many waves internally, which is probably even harder. So asking about making vendor noises is (as useful as it may be) probably going down the wrong path - in much the same way this is probably not related to the outages (it may well be, but from the outside it's all coincidence anyway).
IIS Server had/has a memory leak in worker threads that many years ago always forced us to restart the server every few days. Starting in 6.0, they added worker thread recycling and made it a mandatory to choose a time period for every thread to be recycled. Why fix the error when you can just restart the service?
For old-school mod_perl apps setting MaxRequestsPerChild was often a much better ROI than actually finding and fixing the leaks.
Speaking as somebody who's done over a decade of large scale OO applications perl and is actually really good at finding and fixing the leaks, this has often been intellectually aggravating but every time I've set that option instead I rewarded myself with a glass of bourbon for picking the pragmatic choice and then went back to adding (non-leaky) features that were far more useful to the company in question than cleaning up the older code would've been.
For GitHub? It seems unbelievable that they would use IIS pre-purchase and why in the world would you mix in a second web server for post-purchase enhancements.
Why trade an open source solution with third rate garbage that is called IIS which runs on a sub-par desktop OS called Windows. I thought that Github was supposed to be independant.
If GH is around the same level of integration with Microsoft as my employer, which is another Microsoft acquisition, I don't really believe you have a ton of insight into GH processes.
I dated a girl at GitHub for awhile last year who said they weren’t even completely off of AWS yet and she liked how they didn’t seem like working for Microsoft. Maybe this has changed though.
I'm not a DBA, and maybe you're not a DBA either, so this question goes to DBAs who may be reading: aren't you always better off killing the bad queries instead of rebooting the whole box, if that's an option? (ie: aside from times when the entire host is screwed, load per core is >50, metrics aren't getting out, you can't ssh in etc)
They could use multiple writer hosts and rollover the restarts. MySQL has had GTIDs since 5.6 and replication groups rather than writer-replicas since some 5.7.x version.
It seems like we haven't had a non-robot status update on the status page in days since this what seems like daily occurrence. I figure at this point we'd get something of why this is happening.
I also don't appreciate our builds freezing, unable to be cancelled and then eating up hundreds of minutes.
Billing should always be built on a "ping" IMO and not start/stop hooks. The latter is shockingly bad for customers during times of unreliability. The former sounds stupid and requires more infrastructure from the one offering the service, but I think it's more fair.
I haven't used GA in a way where it actually costed me anything, but having minutes just tick away while you can't do anything is really stupid if that's the case.
Edit: Another sane solution would probably be to record outage periods and have Billing automatically reconcile for every customer when invoicing. This would require them to admit the outage durations however, so it may be flawed from a human perspective.
The "ping" solution is an interesting one that I haven't seen proposed before.
At what rate would you do these pings? I don't know how upgrading/downgrading works at GitHub but if they do any sort of refund/credit when you downgrade, it seems like there's some interesting implications for abusing the system (e.g. upgrading/downgrading between pings for "free" service if the time between them is too long) versus performance (e.g. how do you update all users per ping in a timely manner if the time between them is too short?).
Would love to read up more on this approach; seems interesting!
I suggest you add the timeout-minute property on the job/step, so even if the web interface isn't responsive the job times out eventually. Saves you from spending time emailing support about consumed minutes.
Of course, assuming that a future bug won't affect the timeout-minute itself.
This is totally unsurprising and also totally unacceptable IMO. They should automatically wipe out all build minute usage during outages for every account if they insist on architecting their system in this way.
Fossil is faultless for a team size of one. I've been using it for nearly a decade, doing totally non-optimal things like using versions released years apart on different OSs with the same database. I also ctrl-c it when I spot a typo in a commit message and check in binaries. Never missed a beat.
As headcount goes up I think the inability to locally rewrite history into easily reviewable patches would be sorely missed. So it's git for team stuff and fossil for my own.
It's NIMS - FEMA lingo, which predates ITIL. Which was developed in USFS wildland firefighting, which predates FEMA. It's incident management all the way down.
I'm trying to run github action for couple of hours now. They don't work at all. But apparently this means they run, but in infinite time, hence == degraded performance, nice.
We are scheduling a call with an enterprise sales person next week.
If I can get all the Github features I had as of ~2020, but on an instance that wont get hit by the public cloud/update bus, I would be exceptionally happy.
The only complaints we have are regarding availability. If we can fix that one problem, this is a perfect product in our view.
Github Enterprise hasn't been faring too well at my work either this week. When you work on both open and closed source products and GH and GHE are both down, it leads to a very unproductive week.
GitHub Enterprise is confusingly both a "call us for pricing" tier of GitHub the website, and also an on-premise version of GitHub that you can run as an appliance in your own data centre. The first of those is ultimately just GitHub and so has the same outages, the second is running on your own hardware so (shouldn't be) tied to the website's availability.
There are multiple products: self-hosted (Enterprise Server) and hosted by GitHub (Enterprise Cloud). I don't know about uptime guarantees, but you can buy Premium or Premium Plus support with 30-minute SLA or a dedicated account manager.
I’m not sure at what organization that is true. My company lives out of GitHub and Jira and I’ve hardly noticed the three month surge. GitHub would have to do a lot worse to get many companies to want to host their own services. This is the argument people have said about the cloud from day one.
People want to know it isn’t their problem, that makes cloud computing (and things like GitHub) worth their weight in gold. I have real problems to solve I don’t want to deal with a git repo manager on top of that.
Also, looking at this it seems like GitHub isn't doing the common SaaS thing of just lying on their status page. Many providers, both internal and external, would look a lot worse if they had honest status pages.
They are green for good 15 minutes from first moment i see problems, not the first time, it happens actually quite often. Maybe that's the time they need to confirm/cross check/write status update, don't know.
While quicker reporting would be better, 15 minutes is anecdotally a lot better than I see from most other services where their status pages will report all-clear hours into full outages.
They probably allow regular SREs to trigger an incident on the status page on their own, when the likes of AWS and other bigger cloud providers are rumored to need approval from a VP[0] to update the status page.
Several of the recent outages were much longer (at least for us, here in Asia) than they admitted on their status page. In one case I started work, noticed I couldn't push to or pull from GitHub, that situation persisted all day, and around 5pm local time (so morning-ish in the US) suddenly their status page acknowledged the problem and a discussion started on HN.
They do intentionally or not lie about this on their status page.
From December 25th to December 31st 2021, Github actions had network problems almost every single day for hours and the status page was green out through out that period.
Same thing also happened few months back.
It feels like they do this manually and it's only done when enough people are effected.
This has been my experience as well. I don't know if that means GitHub is being overly transparent about issues or I've just been lucky but I would hate if people punished services for being transparent and informative on their status pages.
GitHub's outages have hit me hard over the past week or so. I don't think it's a matter of them being transparent--if anything, I was hitting errors well before their status page updated. Yesterday it was completely unusable for much of my workday, and today tasks that normally take me a few minutes have been taking hours.
> I’m not sure at what organization that is true. My company lives out of GitHub and Jira and I’ve hardly noticed the three month surge
These have been minor inconveniences for us - at worst. Most of the time it simply means people jump to something else then come back later in the day.
Failing tests and PR feedback cycles are more of a blocker to our team than these outages.
At my organization it's always been true. Setting up GitLab is fairly easy, in my company we do it and it's cheap (on-prem hosting is basically zero, and we had the IPs/domains already) and it hasn't given us too many headaches. I think last time I had to do something was maybe a few months ago when I restarted it so that it picked up the updated SSL certificate.
Self-hosting always increases the operational burden of making sure your systems are secure. Maybe you have the engineering resources to spend on patching everything immediately and conducting in-house pen tests, but for most companies it's much, much more secure to let the software's developers host it as well.
Not necessarily. Self-hosted services are protected by company firewall / VPN. They can setup very restrictive network access. They don't have the same level of risks as public services like GitHub or GitLab.
Except that the software developers hosting is also a much, much bigger target and you generally do not have any real control over how often they are patching either.
>> Setting up GitLab is fairly easy, in my company we do it and it's cheap (on-prem hosting is basically zero, and we had the IPs/domains already)
In what tech company is hosting or domains the main cost centre? Many companies spend more on a single hour of a dev's time than their entire GH monthly bill.
...What? $10 x 1000 = $10k / month. $10k x 12 = $120k. That is a new grad software engineer salary in any US city. You'd pay more than that for a single dev with the devops and security experience to keep GHE running and patched for 1000 devs.
I’d say it depends, I run my own on prem server and gitlab was a PITA. Too many moving parts, updating took too much of my time, and I never felt “safe”.
Moving to gitea solved all of those issues for me (thus far), now I’m looking into adding other stuff like CI through Drone.
It definitely depends. We’re pretty early stage and I’m the senior engineer+infrastructure guy so running our own gitea instance or whatever is just more time that I’m almost out of.
I’m on PST time, some of our other devs are on the east coast and one is in India. I think we’re spread out enough it should be an issue but maybe we prioritize different things.
I think the impact was for some reason not consistent between users (maybe due to geographical factors or maybe sharding of accounts?). We're in Asia and I think we've had three different days recently where we couldn't actually get much work done due to GitHub being flakey or down for the entire day and our CI/CD and development processes being built around it. We ended up moving off GitHub onto a self-hosted system, which took about a day of work for one engineer (CI/CD itself was already self-hosted, so just Git, issues and PRs), and there have already been two more GitHub outages since then.
I will say that for us this is a huge deal. We're a devops services company, and our customers expect their deployment pipelines to work. This is becoming a huge pain-point for a few of our customers and we recommended Github Actions to them. A couple of our customers want us to move away from GitHub actions because of how disruptive outages have been.
20 PRs waiting in line for half a day to be merged is pretty annoying. We’ve had that on multiple occasions the last few weeks due to GitHub incidents.
If you want companies to be honest on their status pages (I do!), you can't just count incidents like that. Status pages can be an amazing place to communicate all kinds of problems.
Most issues have a relatively narrow impact, but the impacted people _still_ benefit from seeing them listed.
Use vendors who do a good job communicating status, basically. I don't think you can change AWS behavior. But if you find a hosting company who does an amazing job with their status updates, put some apps there (_my_ company does an ok job with status page updates, we're getting better, it's not amazing yet).
The snarky answer is "literally all of them", but one real answer is that I've been pretty happy with GCP's status reporting for the past year-ish I've used them. I've only noticed a few incidents, but every time I've checked the status it was already updated. They also occasionally provide workarounds on the live incident pages if you need to be back up before the issue is fixed on their end.
> At this rate the benefits of running your own gitea or gitlab are starting to become competitive
When you host things yourself, you still have downtime. And, having worked with Github for over a decade, the actual disruption to my work is from downtime is much less than if I had to host my own.
That being said: I briefly worked for a company that hosted its own source code control system. For us, as a small team, it wasn't worth it. The system was outdated and hosted in an insecure manner. No one ever did any "admin" work except the founder. He ran it because he had irrational fears of switching, not because of any tangible advantages over Github (and competitors.)
Keep in mind that Github (and competitors) are often cheaper than the time needed to invest in hosting your own. (Estimate 10-20 hours a year of invested time. Calculate your hourly rate. Github and competitors are cheaper.) In order to come ahead, you need tangible benefits other than "I think I can have less downtime."
Dunno, I got blocked from my work SaaS hosted gitlab for about a month by cloudflare. Nobody at gitlab or cf helped. I only figured it myself after about 4 hours of research, that it was caused by some disabled (by me years ago) web tracking APIs no-one should have hard dependence on.
I certainly would not have this problem on self hosted instance, because it would not be behind CF. I'm sure I'd have other problems though. :)
All software is crap. You can be either spending time fixing it yourself, or spending time begging online for fixes/help from some SaaS company/community with resolution time in months, somtimes, all that while you may not be able to use it fully.
Also with SaaS it will be constantly shifting under you. Things will be moved around, restyled, iconized, popupized, etc. This doesn't help productivity either. With self-hosting, you can at least avoid upgrading, if you dislike this kind of thing. Or choose FOSS software that values UX permanency/stability, which seems to be really hard ask from SaaS business.
I wondered if those error rates were proportional to Github's growth over time, so I looked it up. It seems that they have 40M users in 2019[1] and 73M users in 2021[2], which translates to 0.975 incidents per million users per year in 2019 compared to 1.178 in 2021.
So perhaps they are not exactly improving, but maybe there is some other way to normalize the data.
One would thought when they got acquired by Microsoft that the number of incidents would go down considering all resources Microsoft would provide but no.
So, if my math is right (for 2021 only): 1888 min / 525,600 min = 99.64% uptime.
If it was more like 99.80+ I think I would be like "meh", but honestly for the price you pay that's not terrible. Still, for a company at the Microsoft level, it should be 99.80 at least.
Only if you believe those numbers mean anything. What are the errors for? Github has been adding lots of features and subproducts over the years, becoming a bigger and bigger platform as a result. What you want is the error-per-component, which may very well have actually gone down, with error spikes coming from "when github adds a completely new feature and it goes through a slew of incidents in its first year". The bigger the feature, the more incidents.
Without more detailed numbers, there's literally no conclusion to draw here.
Every place I ever worked at understood that if you x3 the codebase/infra/interaction surface/etc, you can expect x3 errors. If the total number of errors don't go up as you grow you're doing amazing, and if they go down even though you're landing more and more code for more and more features and subproducts, you have a genuine miracle.
But how many of those actually affected you? For example, no amount of issues around codespaces or github packages would impact my professional use of github, so whether there are 21 or 5000 or those parts get permanently taken offline makes no difference in what I need out of the platform.
How many core incidents? The part that affects whether you can even push to and pull from a repo, and access issues and PRs? Because everything else is nice to have, but you can do work perfectly fine without them if they go down for a few hours.
I was affected by the one last week, the one yesterday and the one today. The one today was harmless but the other two disrupted out work. All three were "core incidents", but the one today felt shorter.
Yesterday's affected me, I couldn't pull or push and when I tried to look at the repo to do PRs I got 500 errors. That only lasted maybe 30 minutes though.
We run Gitea at my company. In fact, we forked it. It could reeeaaaalllly use a rewrite. If anyone is even mildly ambitious about creating a new alternative to Github/Gitea, it's a great time to do that.
Another self-hosted project in the space that i've seen was GitBucket, although it runs on the JVM (not necessarily a bad thing, just different from Go): https://gitbucket.github.io/
And whom pays for fixing it? Downtimes of self hosted systems using external software can be far longer. GitHub, unlike Amazon and friends, doesn't lie about their downtime. Every saas has hundreds of downtime instances across the board every month. Some are small enough you don't see them. Yet the services still work exceptionally well - and when they don't they get fixed in a quick manner. What takes them an hour would take most private orgs a day.
> GitHub, unlike Amazon and friends, doesn't lie about their downtime.
Are you kidding? The last 2 incidents were called "degraded performance". Where "degraded" meant I would get nothing but 500 errors accessing GitHub.com either via browser or git itself for the duration of the outage. How is this not lying?
Well, I think I have said that since 2020 [0] and it is self-evident that you are better off self-hosting your own Git repo. If you can host a website you can do it. If GNOME, ReactOS, Wireguard, Linux Kernel Project, Mozilla, etc can do it, so can you. Or even use it as a backup / failsafe just in case.
But going 'all in' on GitHub just doesn't make any sense anymore.
But who can host a website? I would be wary of hosting something that isn’t a 100% static site, out of fear of the amount of attention maintenance would take.
Also, quite a few of the non-profits behind the projects you mentioned have multi-million dollar budgets that they can use to administer their git instance, if needed. I don’t think “if they can do it, you can” is a strong argument for those.
I don't recall ReactOS, or the creators of wireguard having 'multi million dollar budgets'. How is it that even projects like RedoxOS [0] are able to self-host on a GitLab instance using a subdomain, without giant budgets in the millions?
You don't need a 'multi-million dollar budget' to self-host a git repo and may of these open-source projects have been doing so even before GitHub existed for years. Even if they did have such a budget, there isn't an excuse left to self-host and avoid going 'all in' on GitHub.
At the very least I would expect something like what ReactOS is doing by having a self-hosted backup just in case GitHub goes down or vice-versa. [1]
> You don't need a 'multi-million dollar budget' to self-host a git repo
I never made that claim. The argument was “if X can do it, so can you”.
I pointed out that _some_of_these_ (Mozilla, likely the most extreme of them, had over $400 million in revenues in 2020), are quite different from the typical ‘you’, invalidating that argument.
As always, invalidating an argument doesn’t mean its conclusion is wrong.
So when are you going to question this user [0] and others here planning to do the same thing for not having a 'multi-million dollar budget' for self-hosting their own services then?
Since clearly according to you they 'can't do it', despite me saying 'if X can do it so can someone else'. Where 'X' can be even a toy project like RedoxOS, or a messenger project like GNU Ring hosted by themselves and accessed via a subdomain.
Seems like they and other lesser known and funded open-source projects are doing just fine like that for years.
My last bill from Hetzner was ~35€. I host gitea, drone CI, hashicorp vault and my own docker registry/pypi repository. I can add as many users as I want, and I had exactly zero incidents in the past ~6 years since I set this up.
I don't even worry about a strong backup strategy (besides just making occasional snapshots of the data volumes) because this was all set up with IaC tools (Terraform, Ansible) and I have copies of all the code in local repositories.
Can project management features not be made part of a dumb repo on the db side? (Spoiler: yes, and many projects have explored this — setup unfortunately has never been as easy as "we'll invite u to the gh, check ur email".
Perhaps with decentralization push of web3/QR etc, we'll get over the hump.
I think that parent means also things like CI, release repository, PR review, etc.
These are not easily portable, but honestly is because of this lock-in that I prefer to use separate/independent tools. For my open source project [0], I am putting things on github and it is the link that give to most people, but in reality is just a mirror to the gitlab repository[1], which I use for CI and static page hosting, and the "project management" is done on Taiga [2]
The company I work for has a bunch of non-programmers using and working in gitlab (or "the git"), I can't really see it happening with GitHub regardless of where it was hosted.
Gitlab just seems better for actually running a software project.
Does Gitea support some kind of federation / cross-instance PRs? That's the main thing I'd miss from a self-hosted instance, the ease of getting contributions.
After all, you don't even need Gitea for pure Git hosting. If you have a server with SSH access, just init a bare repo in a directory, push to that, and you're ready to go. No web UI needed.
The reason I'm still using GitHub is not code hosting. It's collaboration.
Gitea gets you: a nice GitHub-like web GUI, including for stuff like managing users; 2FA; some integrations; web hooks without having to add git-hooks to all your repos; and extremely-useful-to-some-projects features like git-lfs support.
If you don't want or need those things, bare git repos are fine and certainly easier to support (not that Gitea's that hard, though a few issues/PRs I've noticed have caused me more than a little concern about the overall quality of the project).
But for new open source projects of mine the ease of contribution and user expectation of a github repository are a trade-off worth making even so (I also maintain a self hosted master git repo that I consider the source of truth to -me- but these days it syncs from, rather than to, github, just because of the trade-offs involved).
GitHub doesn't just host Git repositories. It's the central location for discussions, issues, code reviews, milestone planning, and any CI process like testing or releases. If it's unavailable whole teams can be interrupted.
It depends on how many engineers you have! But also, there are plenty of other functions in GH besides raw git, like Wiki/PR/Issues/test/deploy pipelines, etc. It can become pretty critical.
> At this rate the benefits of running your own gitea or gitlab are starting to become competitive
No need, just use Codeberg.org instead. They run Gitea and is a free collaboration platform (+ git hosting) for free projects. FOSS/OSS should really consider alternatives to GitHub and GitLab, especially when there are much more FOSS/OSS friendly platforms around.
I've actually been pretty impressed with the quality of the product and new features over the past couple of years, but it seems to be having a lot of stability issues recently.
Again? Last time that happened was 24 hours ago? [0] It is really getting unreliably bad. Like I said before, having a self-hosted backup seems to make more sense.
Looking at the "GitHub" prefix in the title, I was half-expecting this to point to a report explaining the outage a week ago... But rest assured, it is a new outage!
Work choose GitHub (we are a MicroSoft shop), I have to say, I like GitHub a lot. The disruptions have been annoying sometimes, that's true. But due to the nature of Git I could always just keep working.
And to think Git can easily be decentralized. I wonder if the community could fork GitHub to fix it. Oh, it's not open source. Devs must be too busy working on more 'social' features like "For You (Beta)" to milk the attention economy.