Hacker News new | ask | show | jobs
by johnduhart 1489 days ago
> I mean, you didn't even consider implementing a simple fetch of an already cloned repository in your mirroring server code. So yeah, I'd argue that the bad faith part is actually justified.

https://github.com/golang/go/issues/44577#issuecomment-11378...

> We did consider caching clones, but it has security implications and adds complexity, so we decided not to. It is certainly not trivial to do and not something we are likely to do based on this issue.

Drew continues to act as though he is always correct, and any viewpoint that isn't his is just moronic. I've repeatedly seen this behavior from him in multiple venues over the years, and I'm happy to see the wider community start calling this out as childish.

4 comments

I don't particularly care for Drew, but the issue he's reported here seems totally valid. And if he requested that he be excluded from getting hit by the crawler, wouldn't that mean it would be impossible for people to use packages from sr.ht unless they change their config?

Plus, it does seem reasonable to think that only one of the crawlers needs to hit the site. The global replication can happen at the FS level or, heck, the crawlers can just perform pulls from each other.

No. According to the Go project, adding his site to the exclusion list would reduce traffic to his site at the cost of freshness of the data the proxy collects; it would not make it "impossible" for people to use packages from sr.ht.

This is all in the thread that DeVault linked to from his post.

Which would still be far from great for any kind of source hosting website.
In what way would it be "far from great"?
Well, why have that proxy/functionality in first place if the best option is to disable it?
I don't even understand the question you're asking. Nobody is suggesting the proxy be disabled, including for sr.ht.

It's OK not to know the specifics of what this is about, but it's weird to have strong opinions about it if you don't.

Drew can often be very abrasive, but does it really matter in this case? His site is basically being DDoS'd.

Yes, there are decent arguments why the golang infra doesn't cache or respect typical norms like robots.txt, but they don't change the unreasonableness of the underlying situation. Surely some mitigation could have been worked out in the year since the ticket was filed?

They offered to turn off refreshing of his domain it appears on Jun 8, 2021: https://github.com/golang/go/issues/44577#issuecomment-85692...
That doesn’t seem like a solution at all and is actually kind of punative as that would make srht bad for hosting go.

I think this is just an example of Google being a jerk and not caring enough to do proper software engineering.

Go seems really interesting but I have avoided using it because it’s so tied to Google. And I don’t trust Google to make good decisions for developers or users.

It looks like a solution to me: Google stops proactive refreshing, and so users get data that is fresh up to the cache timeout.

Users who can't wait that long can disable the proxy, and SourceHut can recommend users do that.

Perfect succinct response. It is a 100% viable workaround.
Can you articulate why it isn't a solution, and how it would be punitive? There are people on this thread who appear to believe Google's workaround would mean that repositories hosted on sr.ht would be unusable as Go modules, which is not at all the case.
drew articulated it very well why google's offer doesn't help at all.

https://github.com/golang/go/issues/44577#issuecomment-85693...

A full git clone just to DDOS a hoster to check if the user-experience is still first-class, and filling a proxy is not an acceptable solution for a module hoster who has the pay the hosting bills by himself.

If they want to know if their proxy is still uptodate, a cheap latest change request 8x/hour would be appropriate.

> Have you considered the robots.txt approach, which would simply allow the sysadmin to tune the rate at which you will scrape their service? The best option puts the controls in the hands of the sysadmins you're affecting. This is what the rest of the internet does.

> Also, this probably isn't what you want to hear, but maybe the proxy is a bad idea in the first place. For my part, I use GOPROXY=direct for privacy/cache-breaking reasons, and I have found that many Go projects actually have broken dependencies that are only held up because they're in the Go proxy cache — which is an accident waiting to happen. Privacy concerns, engineering problems like this, and DDoSing hosting providers, this doesn't looks like the best rep sheet for GOPROXY in general.

You didn't answer my question. What's the problem with the Go team's workaround? I get that DeVault would like to redesign the Go modules system to suit his own preferences, but that's not on the table.
In this case, it's really hard to see thrashing other people's servers relentlessly to collect data you already have as anything but incredibly, incredibly poor engineering. Y'all should write him a check for that much resource waste.
This comes to mind: https://news.ycombinator.com/item?id=31496063

>At Google we were told to stop thinking about all this stuff, that the storage hardware and software people were responsible for hiding things like wearout from application developers.

Something tells me this team was told to "stop thinking about all this stuff, that the network people were responsible for hiding things like speed, latency and cost from application developers." aka network is infinite, keep pounding that repo and we will scale accordingly (our side of the equation, sucks to be other people)

without knowing anything about this situation outside of this thread and the post it links to, it comes across as willful negligence to screw over someone who was a bother in past community transgressions
That's a risible suggestion. Even DeVault doesn't say that.
It's less than $200 per month to send out 4G daily. If his business can't afford that, there is something else going on.

What is the total daily bandwidth that sourcehut uses anyway? What percentage is go module fetching?

The 4G daily was a different user who hosted a go module where he was the single user on his own server, this was not DeVault.

I'd be pretty pissed if I hosted a go module essentially for myself and suddenly I have a $200 dollar bill, because google decided to clone my repository 500 times a day. If it doesn't bother you, how about you donate $200 a month to a charity of my choosing, because it doesn't matter to you.

Self hosting costs money, for this one user it would seem the options of blocking or other options are more tenable

If money was a problem, I'd expect this individual to ha e rectified it on their end

So tell me why do people use DDoS protection? It's just money. If you run a server you should be able to eat all the cost!

Seriously do you follow through what your arguments actually mean if applied in general?

There was no actual DDoS, so no need to compare

Should every language be responsible for paying the bandwidth bills for dependencies?

You might look at the most recent comment from the Go team on the issue, there have been no additional requests or events since they last resolved it for both of the effected parties

Plenty of bootstrapped businesses have better things to spend $200 / month on, let alone the time spent trying to figure out where all the anomalous traffic is coming from. As I understand it, it's not simple file fetches either. It's cloning a repo, which involves two-way communication, consumes CPU and RAM, and causes disk seeks. You're not slapping it on CloudFront and calling it a day. Finally, it looks to me like the costs are going to scale the more people he has using sourcehut and writing Go modules.

I don't really understand turning this around on him. Why should he have to subsidize Google? If it's not a problem, why do we have robots.txt at all? Just let bots hammer your site and cope with it.

The current situation can't be the optimal solution. It wasn't even present prior to Go 1.16. Only one company has the ability to change that. What should he do differently here? Why should he have to spend any of his time or money working around an issue he didn't create?

That was a different user. The fact that a user not running a git hosting service is potentially eating $200 a month should queue you into the fact that the cost to Drew is likely drastically higher than that.

Google should be sending reimbursement checks for the damage done here on this issue.

Drew is running a code hosting business and this is a cost of providing a feature to the users. He can pass the costs on if it is a problem. He has lots of options and his competitors are not making a big deal out of this.

I suspect he's drawn his line in the sand and wants to keep it going rather than finding a solution that works without requiring upstream changes.

If I provide a paid service and someone abuses it I must deal with it because my larger competitors deal with it? It's good to know that small businesses have no place in the modern world.
> but it has security implications and adds complexity

Read: we prefer to use your servers for caching. Not good enough. Maybe the issue is people making silly evasive arguments like these while the server load piles on?