Hacker News new | ask | show | jobs
GitHub Actions Down
89 points by kakamiokatsu 1862 days ago
And I'm here waiting for my scheduled build to start...
9 comments

So now I am expecting a major GitHub Actions incident (worst case, the whole of GitHub) to go down every single month at least once. Last time this went down was last month. [0]

I now doubt if they can consistently manage more than a month without a major incident like this one.

[0] https://news.ycombinator.com/item?id=26666843

Isn't this normal with hosted solutions?

IIRC, GitHub, Bitbucket and GitLab.com are all unusable for a few hours at least once a month as far back as I can remember.

Isn't this just commonly accepted as a tradeoff for not having to manage the servers yourself?

I would say it's normal for businesses which are engaged in competitive feature development.

Stability & reliability are relatively easy to achieve if you aren't changing the software frequently.

I'm pretty sure they're constantly working on their software (at least GitLab is), so "aren't changing anything" does probably not apply.
That was my point. If you want stability, look for someone who is no longer funding a substantial development team to work on the software.
This is a profound insight.
Nope, this isn't normal at all for products on AWS, GCP, or pretty much any other cloud provider. Azure is simply a subpar product and its time to stop attempting to sweep its awful downtime under the rug.
It could be interesting to track the uptime of such cloud services. Two decades ago, companies prided themselves with 4 or 5 nines (99.99% uptime). Not anymore. Worse is better won yet again ;-)

Companies and programmers should be aware what they get into when they build such dependencies. Distributed git is a thing, but distributed CI/CD that you could also run locally isn't (yet?).

The fact that it’s down is not the problem, but it’s painful that there’s nothing I can do to fix it.

That’s the part I like about self hosting. It may ultimately be down more often, but I never have to tell someone “Nothing to be done, we wait.”.

Which is better for you, but for your company, "wait, I'm on it" and "wait, they're on it" does not make a world of difference.

What's better, on the other hand, is that you can schedule expected and possible downtimes to a time that causes the least impact to your company; with a SaaS, an update might cause you problems any time.

You're second point exactly nails it. Hosted solutions break because they're busying pushing new features I may or may not care about. When self hosting I can decide when upgrading is worthwhile to my needs and then plan when to make risky actions according to my own organizations time line. Any single org can probably get away with 90% uptime just so long as the downtime is at the correct time.
While I sort of agree, I guess that’s that’s arguable. My bosses really like telling theirs that we are doing something about it too.

In these kinds of situations, you often end up in a situation where they say ‘the problem is resolving itself in region x’, where region x is not relevant to you at all. If you are fixing your own setup you can focus on exactly what is most important (to you) first.

Easy to explain. They switched to the MS Azure cloud for actions.

You won't get high availability from Microsoft you are used to as from proper cloud services. Plus privacy issues. But it's cheap, in this case for free.

I was amused by reports when Microsoft was in talks with Discord that one of the reasons why Microsoft wanted to buy Discord was because they wanted Discord on Azure. Like, was the grand customer acquisition strategy for Azure just acquiring the companies and then migrating over?
My wife works as an AWS/Azure consultant, and she mentions that in our area it's much more common for the non-technical management to push Azure than it is for technology to choose it. Sounds quite IBM-ish/Oracle-ish.
When you open a Microsoft account for a new company, they do a lookup to figure out your area of activity and if you're a good match you get contacted by a sales representative asking you if you want to become an Azure re-seller. Basically for services you sell to third parties you get Azure credits meaning your Azure usage is "free", and your clients pay the premium. I know this because I used to work for a company that did this, and from personal experience when starting a company.

Edit: Here it a tip; if you see a "Microsoft partner <TIER> Cloud Platform" badge on a outsourcers website stay away.

I've had a couple of Fortune 500 companies as clients. Microsoft/Azure is usually brought it as a place that is treated like VMs in the cloud. The setup and management is often handled by Accenture and Infosys. Impossible that that decision was made by Engineering. In fact those Accenture managed setups are almost unusable for engineers. I can't even begin to fathom how much these companies spend on Accenture to setup Azure in a fashion that you can't do anything.

The worst part about Azure for me is always the list of undocumented bugs you run into. On the surface it looks like everything started as an AWS equivalent, but when you have to drill down on something it almost always has some weird issues that you then find as unresolved complaints on some MS managed github issue list.

But hey, maybe I was just luckier with the other cloud providers.

I worked on a project automating some parts of an Azure infrastructure for a big company. Half-way through development, JSON integers returned by Azure changed from strings to ints, back to strings. E.g., "42" became 42, then a few weeks later went back to "42".

This and other API weirdness gave me such Azure PTSD that I promised myself I would never touch it again.

That's exactly the situation I'm in at my current work environment now. All Azure, and everyone in Engineering/Devops hates it. It's a business decision though.
Sounds like MS, that is the only way they can get organic customers. All their popular products are originally built by someone else except Windows of course
Buying products though is very different than buying customers; if we try to map this hilarious customer "acquisition"--which is now a double entendre ;P--strategy to a more typical product, it would be akin to saying "no one is using Excel, so let's start buying large accounting firms currently using VisiCalc to migrate over".
Azure, .net, office, office 365, teams, ml.net, ...

Not a single one i could manage building ( eg. Teams has bots, quick to create apps and pretty advanced cam features)

Teams UX is a hot mess; it's just astonishingly bad.
Hopefully this serves as a good dogfooding excecise for MS now and help them improve things.

Every time I tried azure I was disappointed. But that doesn't mean they can't fix it; I bet there are now tons of talented engineers working there. My best wishes for them to up the quality of azure. I think diversity / alternatives are a good thing.

Down ~4~ 9 hours so far today.
In the last week, there have been at least two ~1 hour periods where actions were stuck in the queue. Even posted about this on the GitHub community, but no response. [0]

As unreliable as GitHub Actions are, their convenience factor and price are right.

We take a very simple measure so we don't get fucked by these kinds of incidents: we don't use any actions from the marketplace.

All our GitHub Actions workflows are bash scripts that we wrote (and which often live in our repos at `deploy/deploy.bash`). The secrets necessary to run these scripts are available to the infrastructure team on 1Password.

This makes it easy for us to deploy manually and retroactively reflect that release on GitHub (e.g. through a tag or a release).

[0] https://github.community/t/github-action-stuck-on-starting-w...

This is an interesting approach. I also dislike the current architecture they are pushing. Actions can break any time. Makes pipelines/actions very fragile and as you pointed not portable either.

The portability could be fixed by having a local cli runner that understand action yamls. Would be interesting to explore this.

For a while i have been fantasizing about an universal pipeline language. Like having LLVM with a unified model that can translate into different vendor implementations.

> The portability could be fixed by having a local cli runner that understand action yamls. Would be interesting to explore this.

This is a great idea. I guess the challenge would be keeping up with the more advanced aspects of Actions, eg. spinning up multiple VMs during the build process and other things that have a heavy infrastructure (or platform-specific) element.

I'm currently caught by the GitHub Actions downtime, but like the GP, all my build scripts generally make very light use of the platform-specific features and generally I keep most of the build logic in a build.sh file.

So I can build production builds locally if I need to - but a CLI that lets you "properly" run the GitHub yml files locally would be very interesting.

My workflows have been queued for at least four hours. It's a hotfix commit. GitHub Actions helped me a lot in synchronizing updates with other repositories automatically, but this time I have to manually apply it. Because I'm the first time expecting GitHub Actions incident, I was wondering that I ran out of my free quota for this month...

By the way, should GitHub send an email to the owner if any workflow has been delayed for an unreasonable time?

Personally I'd rather they focus on the issue rather than negotiating with PR and legal to formulate an e-mail. If you rely on an external service and don't monitor it that's on you.
What a trite response.

The question was do they do a email if your job is delayed or late for whatever reason?

Not, hey why don't they email us all right now about the issue.

And no, it's not on me to monitor every little thing I rely on. Do you monitor kernel updates? I bet you don't. Besides that monitoring and logging for any provided service is exactly how one is supposed to monitor said external services so asking about monitoring options and being told, look buddy it's your job to monitor for this is just fucking rude.

Well, I do, but yes no one can monitor everything. The question was should they have sent an e-mail and I shared my preference whereas to their prioritization of resources. And yes, if something I wasn't monitoring breaks, I still assume responsibility, most especially if it affects production. And no it wasn't rude, you seem very sensitive.
Have you tried monitoring GitHub Actions? It’s not uncommon for me to find that actions just don’t run for some reason. The docs are so incomplete that it’s hard for me to know why.
I can't say github actions but azure devops yes, I have an http endpoint I want stuff to hit with outcome and if it's not I get bugged. Anything is going to break, for external stuff this is the only way to estimate the cost/benefit of a contingency.
We build an Electron app for MacOS regularly. We run these builds on Github’s MacOS VMs.

The internet fails to connect when running yarn install about 3 out of every 5 times.

We’ve gotten multiple refunds for this issue, but it’s still a complete mess. Unfortunately we’ve built much of our process around GH Actions... if not I’m sure we would be on CircleCI or TravisCI already. We’ve also considered switching to self-hosted runners.

We’re running everything on self-hosted runners and it works like a charm. They are also way, way faster than what GitHub can provide.

We get the advantages of the huge GitHub Actions ecosystem while having impressive (and fully controllable) performance and very easy access to out infrastructure for deployments.

What do you use to run the self-hosted runners?
In our case, we use a mac mini solely for building mobile applications via fastlane with a self-runner. Doing that with the GitHub runner on Mac would be mega expensive. The rest of the CI/test process runs on GitHub runners.
Does that work with M1/Apple Silicon Macs or Intel-only?
GitHub right now only has their self-hosted Mac runner released as an Intel Build, but it works fine through Rosetta, though running Apple Silicon Xcode might require some additional wrapping of commands with `arch` which might or might not be built into the pre-built action you're using.

We're currently using an Intel Mac we've used from before we migrated from Jenkins to GitHub actions, so your mileage may vary.

We're using https://next.yarnpkg.com/features/zero-installs, and it gets rid of one off issues like this (unless you have native deps which require an install).
Thank you for the link, looks interesting! I wish this were available in Yarn V1 :/
FWIW, GitHub Actions macs are MacStadium macs.
We only use GH Actions for check builds right now, which are easily disabled as a temporary measure.

Having the ability to build & deploy your software outside the confines of a cloud vendor is essential to survival. When the automation works, its great. When it doesn't, have a manual process that you can follow on a local workstation.

At the end of the day, you can always email the customer a zip file and walk them through installing the update in production. That is, as long as you didn't make your architecture and CI/CD one in the same thing, in which case you probably need to hit the reset button and try again.

Great advice. How is this managed where you are? Do you share the same code between CI and the local workstation?
Everything required to build our application lives in a single code base & Visual Studio solution. The application is capable of building itself from source.

If you know how to do things like Process.Start, there's really no excuse for not being able to automate your build processes using code. MSBuild has a pretty damn simple set of CLI args if you are just doing modern .NET 3.x/5.x apps.

  git clone <my repo path>
  cd <my repo path>
  dotnet build --configuration Release
  //copy build artifacts to where ever they need to go
That's about it for us.

We use SQLite, so there aren't any dependencies outside of any particular checkout of the repo.

Tip: you can run and host your own GitHub Actions runners.

https://docs.github.com/en/actions/hosting-your-own-runners/...

Since I have a Job that runs every hour, I can see that it's been down for ~4hours.