| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by a10c 21 days ago
	My action failed with "Unexpected error fetching GitHub release for tag refs/heads/master: HttpError: Sorry. Your account was suspended" Which certainly made me shit myself, briefly.

5 comments

neya 20 days ago

It's an eye opener. Think about it - today, it was a mistake. But, what if it really happened? What if you really lost access to all your years of hard work? It's a wake up call. A blessing in disguise to store what matters to you the most locally, backed up offline. Never trust any single provider. Be it MS or Google or Apple. RAID is the way.

onion2k 20 days ago

People should use something that keeps a local copy of their code and just copies it to Github and to other contributors with a sync process to push and pull changes. Some sort of 'distributed source control system' maybe. Then people would only need a 'hub' to connect to people, and it'd be easier to move somewhere else.

gopalv 20 days ago

> Some sort of 'distributed source control system' maybe

The day it broke away and became centralized was when we had a PR + mandatory "Required actions" to merge to main.

ruszki 20 days ago

That’s only mandatory on the “hub”. I can do that locally anytime.

bergie 20 days ago

I'm looking at setting up rngit mirrors of all my repos on our boat NAS. Conceivably it also allows issue tracking and collaboration without centralized infra

https://reticulum.network/manual/git.html#mirroring-reposito...

fusishch 20 days ago

What you just described is Fossil. It has an auto-sync feature that makes everything feel distributed.

Just set up a Kubernetes deployment and you’re set.

But as others mention, GitHub’s primary strength is collaboration. If you want decentralized, solve this by creating a decentralized collaboration tool on top of fossil and/or git.

For example, how to do pull requests and code reviews?

40four 20 days ago

Why they just described is Git :) pretty sure it was a joke

marricks 20 days ago

I like how tech seems to be all about stacking more and more turtles on top of each other:

Gosh, it's hard figuring out what changes Lorne made if only we had a system to merge those changes. Enter git

Gosh it's hard figuring out what packages Rachel had to make this work. Enter rubygems/pip/npm

Gosh it's hard figuring out sync these changes across a network. Enter github

Gosh it's hard figuring out how to get those packages working on my operating system. Enter docker

Gosh centralizing our distributed version control software system onto one website is getting really unreliable. Enter fossil(?????)

If we go any further having one computer per business with a sign up sheep is starting to sound pretty fucking attractive.

coldpie 20 days ago

This gets tiresome. Github is a lot more than a host for Git repositories. If you want to suggest that people use something else, you need to suggest a replacement that has the features people use Github for.

ornornor 20 days ago

Increasingly less and less so as they “upgrade” their offering and have more and more downtime.

doctorpangloss 20 days ago

yeah, #1, it is free private file storage, and #2, it's a download portal for free as in beer software replacing paid offerings. that's what it is for 99.99% of people.

being a host for git repositories has never been its core competency. neither has its groupware offering.

does it even serve OSS well? a very interesting criteria is, "Have mature or adopted end-user-facing OSS recently merged a large PR from an unallied contributor?" The answer is overwhelming no. This is why there is so much innovation in this space.

danudey 20 days ago

I think you missed the joke, which is that the parent poster you're replying to is suggesting a 'solution' to the problem which evolved in complexity until he was just describing Github again.

mpaco 20 days ago

I recently got my GitHub account suspended for 4 months. When it was finally reinstated, their support just said it was a "mistake".

Proudly self-hosting Forgejo since then.

MatthiasPortzel 20 days ago

This happened to me as well—thankfully not my personal account that I use for work, but the organization associated with an open source project I worked on was suspended. It similarly took 2 months for GitHub to restore the organization.

> Our team is currently experiencing an unexpectedly high volume of tickets which has resulted in longer response times than we prefer. We acknowledge the long wait and apologize for the experience.

> Sometimes our abuse detecting systems highlight accounts that need to be manually reviewed. We've cleared the restrictions from your account…

Fully self-hosted IMO can be an overcorrection. The issue isn’t “relying on other people”—it’s relying on GitHub, when they’ve made it clear they don’t care about uptime and they don’t care about support turn-around-time.

SpaceNoodled 20 days ago

I care about uptime and have instant support turnaround. Self-hosting sounds like a great solution.

iso1631 20 days ago

Well yes, my git repositories sit on my laptop, that's the entire point. If github banned my country because its president has a tis, I can push my entire commit history to another company. Same with anyone else who's working on it.

It would be a pain as I'd have to set up a few integrations again, but github is far lower down the risk scale than the vast majority of SAAS providers

bulbar 20 days ago

They rely on GitHub actions, not the repository itself.

I hope people here are aware that you can push your repo somewhere else if wanted.

Git is a distributed system, there isn't even a server, only other git repo instances that are remote.

iso1631 20 days ago

I rely on actions, but those actions are pretty much "on this type of change to this branch run these scripts"

It will be a hassle to migrate to another platform, possibly a couple of hours work to do the 25 repos in my ~/git/ directory.

Even highly complicated actions can be migrated quite easily -- the source is stored in .github/workflows/blah.yml

noselasd 18 days ago

I've set up a local gitea now, and configured a few local runners as we test this setup out.

It's a few hours worth of work. Basic git operations and pull requests works fine for us already.

The interesting part will be how much maintenance this will need, and not the least how hard it'll be to port over github actions. We have trivial workflows, but I suspect this conversion will be the painful part.

corvad 20 days ago

RAID is not a backup.

PokemonNoGo 20 days ago

They... Didn't describe RAID? More 3-2-1.

filleduchaos 20 days ago

The last sentence in the comment is literally "RAID is the way".

jrockway 20 days ago

I think they were intending to evoke the image of RAID rather than literally referring to a redundant array of inexpensive disks. You host your code on Github, Gitlab, and at home, then you survive a Github outage. It's a redundant array. Not sure it's inexpensive, though.

grim_io 21 days ago

A brownout redefined.

DonHopkins 20 days ago

ShitHub

https://www.youtube.com/watch?v=LGeOee7x5lY

lachieh 20 days ago

Good thing I'm wearing my brown pants today.

drcongo 21 days ago

Same. It's weird how I always find out that GitHub is down before GitHub does. Took 15 minutes before it appeared on githubstatus.com

jaapz 21 days ago

All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.

logifail 20 days ago

> All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.

Is it true that official service status pages are updated automatically?

baby_souffle 20 days ago

> it true that official service status pages are updated automatically?

Depends. Typically no because there’s an art to crafting the actual message around impact… but sometimes yes it is automated

logifail 19 days ago

> Typically no because there’s an art to crafting the actual message around impact

I was thinking more of needing to notify/get sign-off from management...

baby_souffle 19 days ago

> I was thinking more of needing to notify/get sign-off from management...

Yeah, that's usually part of it. Precise language matters a TON when you might have some expensive breach-of-SLA terms.

Sometimes the people first responding don't even have the full picture yet and can't fully articulate the impact so they leave it vague.

hnlmorg 20 days ago

You'd expect them to be monitoring more than just the HTTP response codes from user requests for precisely this reason.

If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.

But effective monitoring is harder than people assume.

dncornholio 20 days ago

> If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.

Isn't that what monitoring actually is? The issue seems to be in their testing, not monitoring.

hnlmorg 20 days ago

No, monitoring for HTTP response code is a subset of observability and not one that generally gives you the best insights into which subsystems are misbehaving nor why.

There are synthetic tests, where you can generate API request calls or even simulate an entire user journey. These allow you to control the user agent, the payloads, and thus you know anything errors back are actual errors. These are triggered by the observability platform (think like running a cron-job) and thus you're not tied to user activity to see when problems arise.

There are other metrics outside of HTTP response codes too. Think like free RAM, CPU usage, disk space, etc. This is just naming some obvious ones because these types of metrics are generally bespoke to the type of application your monitoring. And with these types of monitors, you'd not just have an alert when things have failed, but ideally have alerts when an irregular trend is showing that things are likely to fail too. This latter type of monitors helps you get ahead of the problem before it become customer facing.

Then you have more traditional stuff like logs. This will also be bespoke to the application. But you'd expect errors in logs to get surfaced quickly. Assuming Github have good hygiene in what's being logged.

Tie that up with APMs, RUM, and other goodies like that and you'll have diagnostics to investigate issues when they appear.

(this is just a super high level view of observability too)

lokar 20 days ago

Even a synthetic probe needs a few failures to trigger an alert.

You should not alert on cpu, ram, etc

re-thc 20 days ago

> But effective monitoring is harder than people assume.

Who says public status page equals internal monitoring.

They likely know faster than you. Whether they post it publicly is a different issue (hint: SLA penalties, news impacting stock etc)

hnlmorg 20 days ago

I never mentioned anything about status pages.

Are you sure you’re replying to the right comment?

re-thc 20 days ago

> I never mentioned anything about status pages.

For context, the parent comment you replied to started with status page.

Then are you talking about internal leaks or just guessing? Otherwise besides what's public how do you know they don't know?

echelon 21 days ago

In a high performance service with good maintenance and upkeep, you page for all 500s. A noisy pager forces the team to fix the 500s.

Maybe the Github Actions infrastructure isn't run like that.

edit: my oncall rotation notified on all 500s, 24/7, not just rates - https://news.ycombinator.com/item?id=48279262

Doohickey-d 21 days ago

Im curious about this: because in my experience (working on smaller services though), a small number of errors is always there, as a "baseline".

Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"

Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.

KPGv2 20 days ago

Bitflips are something that can happen in consumer-grade RAM, so that tracks (and it's comforting that wayward cosmic rays are a substantial reason for an application's crashes!), but on enterprise servers, they will run ECC RAM that is very resistant to bit flips.

This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.

I wouldn't expect bit flips to be a significant contributor to enterprise problems.

Anon1096 20 days ago

Bitflips specifically may not be; things like network issues, noisy neighbors, row/rack/host maintenance (leading to a downed and migrated host) absolutely are things that happen at high frequency at scale and cause your background level of errors to be more than 0.

maccard 20 days ago

You've completely missed the point - It's not about bitflips it's about errors that are outside the scope of what's fixable.

bobthepanda 20 days ago

It’s where monitoring for 9s is more important at that scale than absolute errors. So long as degradation is graceful or retried it should not be a massive problem.

It does require constant tuning and adjustment though.

TheDong 21 days ago

Do you know of a single service at a single company that actually does that?

I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.

I know none of those are particularly "high performance" though. Curious where your experience is coming from.

echelon 21 days ago

I worked at a large fintech moving billions of dollars in volume a day.

I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.

We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.

Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.

Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.

It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.

We didn't have "business hours"-only paging either as our platform was available globally, including a heavy install base in Asia.

sunrunner 20 days ago

> We paged for every single 500.

Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?

CBLT 21 days ago

I've been oncall for a different G service that nearly paged on every error. It used the standard error budget tooling, but on hundreds of user buckets because the engineering around locality-specific configuration was... suspect. Many of these buckets had single-digits user. If a user was on their phone and lost signal, I was paged. Very poor oncall experience.

theta_d 20 days ago

The sub-service at IBM cloud I worked on had an insanely small error budget such that pages were nearly constant. On call was hell week until a few of us insisted on fixing the issues. The "few" of us were contractors. The employees seemed more than willing to just let the pages continue.

alexfoo 20 days ago

Some companies pay more if people are paged. It can create a perverse incentive not to fix problems or, in extreme cases, to watch things going wrong, waiting for the page, and then being ready to fix it straight away.

compumike 20 days ago

Re: "page for all 500s": there's a world of difference between "page me with a critical alert at 3am" and "notify me on Monday morning when my normal workday starts". At the extremes:

If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!

If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.

Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.

wasmitnetzen 20 days ago

Shouldn't Github be large enough to not have anyone on-call, but just rotate the responsible team around the world?

alexfoo 20 days ago

One team can't troubleshoot AND FIX every possible subsystem, so you just end up with lots (growing to hundreds) of people "on-call" anyway.

As others have said, follow-the-sun type models do exist, usually staffed by people in their normal working hours (EMEA, Americas, APAC) but this means you've still got to cover the weekend and public holidays (which there are a lot of when you factor in plenty of different countries).

Where you need a quick response you can have a core ops/noc team that looks at things with lower thresholds and shorter windows, and their job is to do the initial triage and then page the appropriate team earlier than they would have been alerted by their own alert thresholds/monitoring.

Actually clicking the button to change the status on a public status page is a whole different topic that becomes very political in certain companies.

bobthepanda 20 days ago

At least when I worked at a Bigcorp a lot of that was being cut to save costs.

awithrow 21 days ago

that is absolutely not the case for any system of size and scale. that would just burn out the on-call team and not result in improvements. Error rates/budgets are used instead.

hnlmorg 20 days ago

It depends what you're monitoring. If it's response codes from user generated queries, then I'd agree with you.

But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.

hvb2 20 days ago

> A noisy pager forces the team to fix the 500s.

I'm sure you're not in ops. Or in a dev org of a service with decent request rates.

What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.

A 50 year old bank API? Maybe...

rhyperior 20 days ago

You only do this when you’re trying to use incident management as a hammer to make a point to somebody whom you have otherwise failed to convince to fix something through persuasive argument. Ie, it’s punitive.

swiftcoder 20 days ago

Yeah, no, nobody runs cloud services like that. At AWS most alarms required failures in 3 consecutive 5 minute periods. Critical things could be on 3 consecutive 1 minute windows - but that alarm starts a 15 minute escalation for the oncall engineer to check in, and they have to validate the issue isn't a false alarm before updating the status page would even be considered

jordemort 21 days ago

forget it, Jake; it’s Azure

registeredcorn 20 days ago

I'm not arguing with what you're saying, but it does make me wonder: What exactly is the point of the status page, if "it is normal for users to already see errors before GitHub officially counts it as an outage"?

Is it more so to have something to link to for managers who aren't using the service have a pretty bar to look at and feel like they are "doing something"? Or is it more of a kind of a way to prevent confirming what you already suspect to be true. E.g. "Huh. Me and Jim are seeing problems. How about you Tom? Oh wait, crud. The service page is confirming it's down now. Never mind! Who wants coffee?!"

filleduchaos 20 days ago

There is oddly enough a middle ground between "zero errors whatsoever" and "outage".

simonjgreen 21 days ago

More likely that 'update the Status site' lives a long way down their incident response plan, and they have alarms going off well before that

jordemort 21 days ago

yeah I mean a company the size of GitHub certainly can’t be expected to have enough staff to walk and chew gum at the same time

swiftcoder 20 days ago

If it's like other BigTechs I have worked at, you need director-level signoff and comms team approval to post an outage notice

PunchyHamster 20 days ago

it should be automatic tho. Probably isn't so they can at least get the one nine on availability

simonjgreen 20 days ago

Marketing definitely takes interest in status sites

re-thc 21 days ago

> It's weird how I always find out that GitHub is down before GitHub does

No, it's not. Official updates = potential SLA penalties. Always requires approval.

drcongo 20 days ago

This is the most plausible reply.

chrisjj 20 days ago

> githubstatus.com

There's a threshold. It shows only once 1000 users complain.

/i

ridiculous_leke 20 days ago

> Which certainly made me shit myself, briefly.

Can you sue companies for inducing such anxiety?

Imustaskforhelp 20 days ago

IANAL, but I can probably imagine a case being made if a person really got so stressed that for example any health condition got invoked from the stress. It might be up to the lawyer to explain how exactly the service caused the stress and its direct relation to health condition though and up to the judge.

but I suppose that there might be some terms of conditions within using github (ahem Microsoft) that you can probably not sue them for something like this.

It really depends upon the severity of situation (imo)

For example, if a person had any heart condition and they got so stressed because of an error at github (which to be fair, I can understand the stress part, imagine losing some part of your software because it was on github and the amount of direct damage to livelihood if your income depended on it)

and I think that the judge might have to be in just the right technical know-spot as well and someone who can understand the situation from programmer's perspective hopefully.

Then I can see a case being made.

once again not a lawyer but an interesting question, would love reading other replies to your comment.

also for what its worth, you can sue any company for X,Y or Z. The question worth asking is if you can win such lawsuit.

Personally I believe it might be hard but not impossible but for all practical use cases it might as well be but the only answer can probably be found in court. I am just guessing at this point.

dvduval 20 days ago

Yes, Thais can be be really frustrating when you’re trying to get work done. There needs to be more competition and better alternatives and the LLMs need to offer easier connection to these alternatives.

weird-eye-issue 20 days ago

What do the Thai people have to do with this? :(

denisw 20 days ago

Pretty sure that they wanted to write "this", typed something different by accident, and auto-correct struck.

weird-eye-issue 20 days ago

Oh gee thanks

superxpro12 20 days ago

Reminded me of the "Thai Fighter" joke from family guy's star wars spoof lol