So now I am expecting a major GitHub Actions incident (worst case, the whole of GitHub) to go down every single month at least once. Last time this went down was last month. [0]
I now doubt if they can consistently manage more than a month without a major incident like this one.
Nope, this isn't normal at all for products on AWS, GCP, or pretty much any other cloud provider. Azure is simply a subpar product and its time to stop attempting to sweep its awful downtime under the rug.
It could be interesting to track the uptime of such cloud services. Two decades ago, companies prided themselves with 4 or 5 nines (99.99% uptime). Not anymore. Worse is better won yet again ;-)
Companies and programmers should be aware what they get into when they build such dependencies. Distributed git is a thing, but distributed CI/CD that you could also run locally isn't (yet?).
Which is better for you, but for your company, "wait, I'm on it" and "wait, they're on it" does not make a world of difference.
What's better, on the other hand, is that you can schedule expected and possible downtimes to a time that causes the least impact to your company; with a SaaS, an update might cause you problems any time.
You're second point exactly nails it. Hosted solutions break because they're busying pushing new features I may or may not care about. When self hosting I can decide when upgrading is worthwhile to my needs and then plan when to make risky actions according to my own organizations time line. Any single org can probably get away with 90% uptime just so long as the downtime is at the correct time.
While I sort of agree, I guess that’s that’s arguable. My bosses really like telling theirs that we are doing something about it too.
In these kinds of situations, you often end up in a situation where they say ‘the problem is resolving itself in region x’, where region x is not relevant to you at all. If you are fixing your own setup you can focus on exactly what is most important (to you) first.
Easy to explain. They switched to the MS Azure cloud for actions.
You won't get high availability from Microsoft you are used to as from proper cloud services. Plus privacy issues. But it's cheap, in this case for free.
I was amused by reports when Microsoft was in talks with Discord that one of the reasons why Microsoft wanted to buy Discord was because they wanted Discord on Azure. Like, was the grand customer acquisition strategy for Azure just acquiring the companies and then migrating over?
My wife works as an AWS/Azure consultant, and she mentions that in our area it's much more common for the non-technical management to push Azure than it is for technology to choose it. Sounds quite IBM-ish/Oracle-ish.
When you open a Microsoft account for a new company, they do a lookup to figure out your area of activity and if you're a good match you get contacted by a sales representative asking you if you want to become an Azure re-seller. Basically for services you sell to third parties you get Azure credits meaning your Azure usage is "free", and your clients pay the premium. I know this because I used to work for a company that did this, and from personal experience when starting a company.
Edit: Here it a tip; if you see a "Microsoft partner <TIER> Cloud Platform" badge on a outsourcers website stay away.
I've had a couple of Fortune 500 companies as clients. Microsoft/Azure is usually brought it as a place that is treated like VMs in the cloud. The setup and management is often handled by Accenture and Infosys. Impossible that that decision was made by Engineering. In fact those Accenture managed setups are almost unusable for engineers. I can't even begin to fathom how much these companies spend on Accenture to setup Azure in a fashion that you can't do anything.
The worst part about Azure for me is always the list of undocumented bugs you run into. On the surface it looks like everything started as an AWS equivalent, but when you have to drill down on something it almost always has some weird issues that you then find as unresolved complaints on some MS managed github issue list.
But hey, maybe I was just luckier with the other cloud providers.
I worked on a project automating some parts of an Azure infrastructure for a big company. Half-way through development, JSON integers returned by Azure changed from strings to ints, back to strings. E.g., "42" became 42, then a few weeks later went back to "42".
This and other API weirdness gave me such Azure PTSD that I promised myself I would never touch it again.
That's exactly the situation I'm in at my current work environment now. All Azure, and everyone in Engineering/Devops hates it. It's a business decision though.
Sounds like MS, that is the only way they can get organic customers. All their popular products are originally built by someone else except Windows of course
Buying products though is very different than buying customers; if we try to map this hilarious customer "acquisition"--which is now a double entendre ;P--strategy to a more typical product, it would be akin to saying "no one is using Excel, so let's start buying large accounting firms currently using VisiCalc to migrate over".
Hopefully this serves as a good dogfooding excecise for MS now and help them improve things.
Every time I tried azure I was disappointed. But that doesn't mean they can't fix it; I bet there are now tons of talented engineers working there. My best wishes for them to up the quality of azure. I think diversity / alternatives are a good thing.
In the last week, there have been at least two ~1 hour periods where actions were stuck in the queue. Even posted about this on the GitHub community, but no response. [0]
As unreliable as GitHub Actions are, their convenience factor and price are right.
We take a very simple measure so we don't get fucked by these kinds of incidents: we don't use any actions from the marketplace.
All our GitHub Actions workflows are bash scripts that we wrote (and which often live in our repos at `deploy/deploy.bash`). The secrets necessary to run these scripts are available to the infrastructure team on 1Password.
This makes it easy for us to deploy manually and retroactively reflect that release on GitHub (e.g. through a tag or a release).
This is an interesting approach. I also dislike the current architecture they are pushing. Actions can break any time. Makes pipelines/actions very fragile and as you pointed not portable either.
The portability could be fixed by having a local cli runner that understand action yamls. Would be interesting to explore this.
For a while i have been fantasizing about an universal pipeline language. Like having LLVM with a unified model that can translate into different vendor implementations.
> The portability could be fixed by having a local cli runner that understand action yamls. Would be interesting to explore this.
This is a great idea. I guess the challenge would be keeping up with the more advanced aspects of Actions, eg. spinning up multiple VMs during the build process and other things that have a heavy infrastructure (or platform-specific) element.
I'm currently caught by the GitHub Actions downtime, but like the GP, all my build scripts generally make very light use of the platform-specific features and generally I keep most of the build logic in a build.sh file.
So I can build production builds locally if I need to - but a CLI that lets you "properly" run the GitHub yml files locally would be very interesting.
My workflows have been queued for at least four hours. It's a hotfix commit. GitHub Actions helped me a lot in synchronizing updates with other repositories automatically, but this time I have to manually apply it. Because I'm the first time expecting GitHub Actions incident, I was wondering that I ran out of my free quota for this month...
By the way, should GitHub send an email to the owner if any workflow has been delayed for an unreasonable time?
Personally I'd rather they focus on the issue rather than negotiating with PR and legal to formulate an e-mail. If you rely on an external service and don't monitor it that's on you.
The question was do they do a email if your job is delayed or late for whatever reason?
Not, hey why don't they email us all right now about the issue.
And no, it's not on me to monitor every little thing I rely on. Do you monitor kernel updates? I bet you don't. Besides that monitoring and logging for any provided service is exactly how one is supposed to monitor said external services so asking about monitoring options and being told, look buddy it's your job to monitor for this is just fucking rude.
Well, I do, but yes no one can monitor everything. The question was should they have sent an e-mail and I shared my preference whereas to their prioritization of resources. And yes, if something I wasn't monitoring breaks, I still assume responsibility, most especially if it affects production. And no it wasn't rude, you seem very sensitive.
Have you tried monitoring GitHub Actions? It’s not uncommon for me to find that actions just don’t run for some reason. The docs are so incomplete that it’s hard for me to know why.
I can't say github actions but azure devops yes, I have an http endpoint I want stuff to hit with outcome and if it's not I get bugged. Anything is going to break, for external stuff this is the only way to estimate the cost/benefit of a contingency.
We build an Electron app for MacOS regularly. We run these builds on Github’s MacOS VMs.
The internet fails to connect when running yarn install about 3 out of every 5 times.
We’ve gotten multiple refunds for this issue, but it’s still a complete mess. Unfortunately we’ve built much of our process around GH Actions... if not I’m sure we would be on CircleCI or TravisCI already. We’ve also considered switching to self-hosted runners.
We’re running everything on self-hosted runners and it works like a charm. They are also way, way faster than what GitHub can provide.
We get the advantages of the huge GitHub Actions ecosystem while having impressive (and fully controllable) performance and very easy access to out infrastructure for deployments.
In our case, we use a mac mini solely for building mobile applications via fastlane with a self-runner. Doing that with the GitHub runner on Mac would be mega expensive. The rest of the CI/test process runs on GitHub runners.
GitHub right now only has their self-hosted Mac runner released as an Intel Build, but it works fine through Rosetta, though running Apple Silicon Xcode might require some additional wrapping of commands with `arch` which might or might not be built into the pre-built action you're using.
We're currently using an Intel Mac we've used from before we migrated from Jenkins to GitHub actions, so your mileage may vary.
We only use GH Actions for check builds right now, which are easily disabled as a temporary measure.
Having the ability to build & deploy your software outside the confines of a cloud vendor is essential to survival. When the automation works, its great. When it doesn't, have a manual process that you can follow on a local workstation.
At the end of the day, you can always email the customer a zip file and walk them through installing the update in production. That is, as long as you didn't make your architecture and CI/CD one in the same thing, in which case you probably need to hit the reset button and try again.
Everything required to build our application lives in a single code base & Visual Studio solution. The application is capable of building itself from source.
If you know how to do things like Process.Start, there's really no excuse for not being able to automate your build processes using code. MSBuild has a pretty damn simple set of CLI args if you are just doing modern .NET 3.x/5.x apps.
git clone <my repo path>
cd <my repo path>
dotnet build --configuration Release
//copy build artifacts to where ever they need to go
That's about it for us.
We use SQLite, so there aren't any dependencies outside of any particular checkout of the repo.
I now doubt if they can consistently manage more than a month without a major incident like this one.
[0] https://news.ycombinator.com/item?id=26666843