Hacker News new | ask | show | jobs
by the_duke 1663 days ago
At least now I know why my nix builds are failing...

Odd for something like this to slip through and not be rolled back immediately.

Unless it was intentional, in which case it would be even more odd to not communicate this widely beforehand.

3 comments

Yep. A bug is something that happens (although too many can be judged).

But I don't understand how github didn't already know about the regression, from automated testing or error monitoring. I have high expectations for github, because they have met them.

A sibling comment (from an engineer at GH) has given some insight: https://github.com/github/feedback/discussions/8149#discussi...

They have error monitoring, but download URLs are one thing that's tricky to monitor for "errors" correctly because if two URLs 404, how do you know from your Grafana dashboard or whatever which are valid and which aren't? Having run a server whose only purpose is to serve static files myself -- you get a lot of 404s, all the time, it's no indication anything is wrong at all. This case would only be picked up by a dashboard if it fatalistically caused an error somewhere (like a 500) but by definition this wasn't ever a 500, it's a 404.

As you note, it's just a bug. Sometimes things you don't understand might actually surprise you, it turns out.

That makes sense.

This was reported like it affected all download URLs based on a git tag, which also means all download URLs appearing on github "release" pages.

If so, I'd have expected there would have been some testing that would have caught this too. Of course, sure, bugs in tests happen too.

Obviously bugs happen because bugs happen. I stand by being more disturbed that it took two days to notice and revert to restore the regression than I am by the fact that a regression happened. Users noticed right away and tried to report the problem, it took two days after that for Github to comment on it and to revert, which seems problematic, no? Especially if the bug was really affecting as many download URLs as I think it was reported; if it was only affecting a minority of edge case ones, that's more understandable.

It's never possible to eliminate regressions. (It may be possible to reduce the rate of them of course). But whether by testing or by receiving error reports from users, it ought to be possible to notice all major regressions in less than two days.

If this affected all links, couldn’t you just monitor it with a general “does downloads of these URLs work” check? You don’t have to (and shouldn’t) monitor 404 returns to test whether downloads are working, the naive and obvious test is to actually attempt to download something.
Naively, I imagine the rate of 404s or the percentage of 404s as compared to all responses could be helpful in such a service.
Reproducible builds via IPFS when?
And then your IPFS pin service has a down time.
The point of IPFS is that as long as a single online host has the content you will be able to fetch it.
Sure, but in case of peer-to-peer downloads it's often the case that not so common stuff have zero seeders.
If Github doesn't at least monitor their 404 error rate for large-scale spikes, whoever is in charge of SRE should be fired.

With no announcements and no response to a now two day old bug report, I see two possibilities:

1)Their monitoring of their infrastructure and monitoring of issues is shockingly incompetent for a company of their size and importance (the fact that it is a US holiday is irrelevant.)

2)This was 100% intentional and they're purposefully looking "incompetent" to get people to shift to using other services for downloads.

My money is on the latter, given others in this discussion are reporting random download link failures starting a month or two ago. A huge number of projects seem to use GitHub as a sort of free file hosting service. I imagine the opex for both storage and bandwidth is a not insignificant amount of money and someone has been told to shoo the freeloaders off the grass.

Announcing they're ending free file hosting for unpaid projects would generate a lot of noise and PR. Instead they just make it unreliable, and people go elsewhere. Multiple people in this discussion have described moving downloads of Github in response, which is exactly what Github likely wants.

You might want to not pick up your pitchfork so quickly.

> A change in the handling of URL schemes was deployed a couple of days ago that caused the regression being discussed here. Due to the amount of traffic that the archive endpoints see, and the high baseline of 404s on them, this regression did not cause an unusual increase of errors that would've caused our alerting to kick in. The change has just been rolled back, so the issue is fixed. We will investigate this issue further after the weekend and take the appropriate steps to make sure similar regressions don't happen in the future.

From the discussion thread on GitHub[0]:

A change in the handling of URL schemes was deployed a couple of days ago that caused the regression being discussed here. Due to the amount of traffic that the archive endpoints see, and the high baseline of 404s on them, this regression did not cause an unusual increase of errors that would've caused our alerting to kick in. The change has just been rolled back, so the issue is fixed. We will investigate this issue further after the weekend and take the appropriate steps to make sure similar regressions don't happen in the future.

[0] https://github.com/github/feedback/discussions/8149#discussi...

> If Github doesn't at least monitor their 404 error rate for large-scale spikes

"Is that a service we can charge for?!"