Hacker News new | ask | show | jobs
by CyberDildonics 2014 days ago
Maybe it would work to put a marker argument (like the IP address as base64) in the URL when there might be snowballing traffic so you can see if it comes back at you. That could be used to send a page with all the links taken out, or just be rate limited.
1 comments

Tricks like that don’t work with sites that are receiving a lot of traffic. Also, the exact solution you’ve described is a liability—IP addresses leak when people send each other links, and having unique URLs like that can cause issues with caching. Sure, we could store tokens in a database, but then you’ve just moved the bottleneck to the database.

We do have various ways to combat these issues; like any website of sufficient size, we have pretty complex methods of detecting problematic traffic and assessing the risk of any given request or session. However, no solution is perfect, and with the number of broken crawlers we see, some will inevitably cause problems.

To be clear, we can adjust our code and block them—that’s not an issue. The issue is that I have to wake up at 3 AM to do it, and even if it’s blocked, dealing with that traffic can be expensive. This guy got his $72k bill forgiven, but don’t expect the websites on the other end to be so lucky. (Yes, yes, ingress bandwidth is often free, but it’s never that simple. Scaling up? Bezos takes a cut. More database traffic? Pay the Bezos tax. Replication of enormous logs to other providers? Bezos hungry!)

Negligence is negligence. If you get in a car and drive recklessly without proper training, even if you didn’t intend to hurt anyone, you’re not going to get a lot of sympathy when you mow down a pedestrian. Likewise, I have little sympathy for people who face enormous bills for abusing powerful tools.

That’s not to say cloud providers don’t have billing problems. The delays are unacceptable, and the budgeting tools are often unintuitive or, as was likely the case here, outright inadequate. But in no universe was deploying code that spun up a container for every URL encountered a good idea.

Should such a mistake result in a $72k bill? Eh, probably not. I doubt this person will make the same mistake again, even with the bill forgiven. Or maybe they’ll just blame Google and attempt the same thing on AWS.

I would think you could obscure whatever marker you use fairly easily, any basic encryption should work. It mostly seems like you could do something that temporarily throttles crawlers to a limit that doesn't affect humans much so you don't have to do something manual in the middle of the night. Statistical outliers that get limited to one page request per second per IP or something like that.

The rest of this is arguing against something I'm not saying, which is fine, but thinking about a solution is not condoning the problem.

> I would think you could obscure whatever marker you use fairly easily, any basic encryption should work.

Indeed, you can, and there are situations in which it makes sense. However, it doesn’t really help when it comes to detecting abuse of this sort. For one, CGNAT causes problems. There’s also the issue of people linking to articles from sites like HN and Wayback Machine. Those two alone make it nearly impossible to automatically rate limit based on an ID in the URL.

CGNAT is a big issue that Western companies tend to neglect. However, it’s increasingly common in places like India, and it’s even seen at times in the US, especially in rural areas.

And, of course, public VPNs are growing in popularity.

Unfortunately, all of these factors mean that performing any sort of risk analysis or rate limiting on IP address alone tends to be ineffective or outright harmful for moderately large sites. You can do some fairly basic categorization (this is from a residential ISP, this is from a datacenter), but beyond that, it’s not particularly useful.

Hypothetically, let’s say:

1. We tag every URL with an IP address association in some way.

2. Someone posts a link on HN.

3. We see lots of requests with IP address tags that don’t match the actual requesting IP address, so we block or rate limit them.

4. We’ve just blocked traffic from HN.

Another hypothetical:

1. We design, calibrate, and test a rate limiting system in the US.

2. Some large percentage of real-world traffic comes from India and is behind CGNAT.

3. We’ve just rate-limited most of India.

4. So we exclude India.

5. But now we’ve rate-limited Nigeria, and malicious traffic from India isn’t blocked.

What we actually end up doing is similar but mostly relies cookies instead, and it’s only a single risk factor. It’s not perfect, and it has some caveats that the URL solution avoids, but it has far fewer false positives.