Hacker News new | ask | show | jobs
by zenexer 2014 days ago
> The ex-Googler reflected that he missed the possibility of pages that link back to each other, causing "infinite recursion."

Although tangential to the billing issue, this is reckless. If you’re building a crawler of any kind, please, please, please prioritize ensuring this doesn’t happen so I don’t have to wake up at 3 AM.

I run the infrastructure for a moderate-sized site with probably about a hundred million pages or so. We can handle the HN hug-of-death just fine. But poorly-made crawlers that recurse like this? They’re increasingly problematic.

If your solution to fixing your crawler is “throw more concurrency at it and ignore the recursion,” and suddenly your requests start timing out, that’s a pretty damn strong hint that you’re ruining someone’s day.

From my perspective, this will look like an attack. I’ll see thousands of IP addresses repeatedly requesting the same pages, usually with generic user agent headers. Which ones are actual attacks, and which are just poorly-made crawlers? Well, if you’ve got a generic user agent string that doesn’t link to a contact page, and you’re circumventing rate limiting by changing your IP address, and you had the bright idea to let your test code run overnight, I’m going to treat it as an attack. At 3 AM, I’m not inclined to differentiate between negligence and malice.

This is happening more and more often, and I partially blame it on the ease of “accidentally” obtaining a ridiculous quantity of cloud resources. People deploy shoddy test code and go to bed. They turn it off in the morning when they see the bill.

It’s become so prevalent that our company has come up with an internal term for these crawlers that spin up a new thread/container for every page: snowballing crawlers.

Save a sysadmin: don’t snowball.

Oh, and include a useful user agent header so we can contact you instead of your cloud provider.

6 comments

Also - as someone with a ton of experience on the other side of this coin:

Puppeteer etc. are nice and all but if you can get away with raw HTTP requests grabbing and parsing the HTML without pulling down stylesheets, JS, etc. do it. It is WAY more efficient than requesting the full overhead for the user experience from these folks and threading out 5-10 workers to gracefully crawl a site this way doesn't typically cause things to melt down on your target's end.

You may be saying "well I need a browser-stack or evaluated JS to do my work" and you may be right... but honestly though 90% of this stuff is reverse-engineer-able with Charles Proxy and some basic webdev experience. Heck - I've even sandboxed JS from a target's site to generate tokens/etc to cut down on repeat requests. Even CAPTCHA stuff can easily be done without having to pull down full UIX overhead these days.

---

"Save a sysadmin: don’t snowball."

Implement thread limits, rate limiting, throttling, intelligent caching, and try to fit within your target's hosting capabilities without being disrespectful. Often I will "smear" large jobs over weeks worth of time so that it's only a trickle of traffic here and there (and to also fly under the radar... sorry).

Also - on the custom UAS: Unless you're trying to make it easy to get blocked/identified then don't take this advice. Let's face it - this is a gray area for most. The best way is to not "snowball" and to make your scrapers indistinguishable from a reasonable stream of real users from real networks. I would never expect a sysadmin to contact me because frankly they aren't paid to.

---

One last thought - the people who are out there writing these bots/crawlers/etc. are often the lowest common denominator. They're the type that will get something "working" and hurry onto the next job because the nature of the work tends to be a ton of low-paid contract stuff. Also, at almost every place I've worked at in ecommerce that has scraping involved it's the bottom-rung dev talent that's assigned to the work.

Sucks, but near-100% I attribute your "snowball" situation to that.

> Also - on the custom UAS: Unless you're trying to make it easy to get blocked/identified then don't take this advice.

I can’t speak for other sites, but we’re pretty good at picking up on crawlers that don’t have a unique UA. The problem is that we’re going to have a hard time differentiating your well-behaved crawler from more malicious crawlers, and you’re going to get caught in the crossfire.

> if you can get away with raw HTTP requests grabbing and parsing the HTML without pulling down stylesheets, JS, etc. do it.

If you combine that with the lack of an identifying UA, there’s unfortunately a good chance you’ll get caught in the crossfire during an actual attack. That being said, it’s good advice otherwise. If you’re trying not to be identified as a crawler, it’s really going to stand out, though.

> I would never expect a sysadmin to contact me because frankly they aren't paid to.

I am. Furthermore, as long as you’re being transparent about your activity (see: UA), I don’t mind working with you instead of your provider. I understand that writing good crawlers is a learning experience; mistakes do happen. When I send abuse reports, usually people just get a slap on the wrist, but not everyone is that lucky.

But, if your UA has contact info, I can:

1. Easily rate limit or block you until the issue is resolved

2. Contact you directly, explaining exactly what’s wrong

3. Easily unblock you once it’s fixed

Sure, I’m not going to be happy about it, but I’m going to be a lot happier than if you try to blend in—a situation in which I’m not going to have any sympathy.

Unfortunately, most sites don’t respond that way and would rather just block anything remotely suspicious. But since you can always change your IP address, maybe try with an identifiable UA first—please? :)

Edit: Also, a few recommendations to add:

1. Be prepared to handle obscure HTTP status codes. 503 indicates you need to back off. Frequent 500, 502, or 504 means the same thing. 429 and 420 mean you’re being rate limited; slow down. 410 means you should stop requesting the given URL. 400 or 405 means you probably have a bug. Any unrecognized 4XX or 5XX error should be flagged and examined so you can handle it better in the future.

2. You can send an X-Abuse-Info header and a generic UA if you want capable sysadmins to be able to identify you but want to avoid being blocked by inexperienced webmasters.

3. Don’t ignore abuse reports.

4. Try to be consistent and ramp up slowly. It’s harder to cope with unnaturally-abrupt increases in traffic.

(2) Is a great idea I hadn't considered. A surprising number of sites require "browser" user-agents but otherwise have well-defined rate limits, robots.txt files, and everything you'd need to write a respectful crawler.

I'm not sure that (4) matters for larger sites? Their rate limits are usually a drop in the bucket compared to the background traffic.

#4 was more to avoid being noticed by someone like me before they’ve had their morning coffee. That being said, if anything does go wrong, and you’ve ramped up slowly, at least it gives autoscaling time to respond.

Generally, though, unless you screw up badly, submit forms, or blend in with a more problematic crawler, nobody’s going to care (or even notice).

Web pages (URLs) is not a DAG and hence it can have loops. Regardless, even if I've never designed a web crawler, I'd think a basic feature would be deduplication; a database (table) of URLs visited with a timestamp (so you can visit again after X days to check for changes, this refresh rate can be also included in the table per URL), so the crawler would check this table before visiting a URL.
Trust me, that's not the first thing you think about when designing your scraper.

Typically, one doesn't care whether the same page has been visited before. What one does care about is avoiding storing duplicate data.

> a basic feature would be deduplication
Which ones are actual attacks, and which are just poorly-made crawlers?

If it walks like an attack and it quacks like an attack...

I am lacking in sympathy for the perp here, as being careless like this has probably caused problems and possibly cost significant money for a lot of people.

However, this is also a compelling demonstration of why cloud services should be required to provide a hard price cap option for safety reasons. I've heard all the self-serving arguments they make about how turning things off surprisingly might be unwanted behaviour and so on. If that's the case, the admin won't set a cap. But there are exactly zero circumstances under which someone who intended to cap their usage at a level that would cost single digits of dollars or remain within a free plan intends or wants to run something that costs four orders of magnitude more than that, and IMNSHO such predatory pricing models should be illegal (assuming that the charges aren't already considered unenforceable by courts under such circumstances; I haven't checked).

It highly likely that this person caused a $72,000 bill on all the websites he has crawled. It's just that the cost is spread out over multiple websites so it is not noticeable.
I don't think so, serving a page from cache is far cheaper than requesting, crawling, and storing that page in a database. Cloud comes with a premium, too.
While that’s true, not everything can be cached, and many websites run expensive code to assign a session to each new “user.” Larger sites generally learn to avoid that or have the infrastructure to accommodate it, but even moderate-sized blogs and forums probably can’t cope with that scenario all too well.
Maybe it would work to put a marker argument (like the IP address as base64) in the URL when there might be snowballing traffic so you can see if it comes back at you. That could be used to send a page with all the links taken out, or just be rate limited.
Tricks like that don’t work with sites that are receiving a lot of traffic. Also, the exact solution you’ve described is a liability—IP addresses leak when people send each other links, and having unique URLs like that can cause issues with caching. Sure, we could store tokens in a database, but then you’ve just moved the bottleneck to the database.

We do have various ways to combat these issues; like any website of sufficient size, we have pretty complex methods of detecting problematic traffic and assessing the risk of any given request or session. However, no solution is perfect, and with the number of broken crawlers we see, some will inevitably cause problems.

To be clear, we can adjust our code and block them—that’s not an issue. The issue is that I have to wake up at 3 AM to do it, and even if it’s blocked, dealing with that traffic can be expensive. This guy got his $72k bill forgiven, but don’t expect the websites on the other end to be so lucky. (Yes, yes, ingress bandwidth is often free, but it’s never that simple. Scaling up? Bezos takes a cut. More database traffic? Pay the Bezos tax. Replication of enormous logs to other providers? Bezos hungry!)

Negligence is negligence. If you get in a car and drive recklessly without proper training, even if you didn’t intend to hurt anyone, you’re not going to get a lot of sympathy when you mow down a pedestrian. Likewise, I have little sympathy for people who face enormous bills for abusing powerful tools.

That’s not to say cloud providers don’t have billing problems. The delays are unacceptable, and the budgeting tools are often unintuitive or, as was likely the case here, outright inadequate. But in no universe was deploying code that spun up a container for every URL encountered a good idea.

Should such a mistake result in a $72k bill? Eh, probably not. I doubt this person will make the same mistake again, even with the bill forgiven. Or maybe they’ll just blame Google and attempt the same thing on AWS.

I would think you could obscure whatever marker you use fairly easily, any basic encryption should work. It mostly seems like you could do something that temporarily throttles crawlers to a limit that doesn't affect humans much so you don't have to do something manual in the middle of the night. Statistical outliers that get limited to one page request per second per IP or something like that.

The rest of this is arguing against something I'm not saying, which is fine, but thinking about a solution is not condoning the problem.

> I would think you could obscure whatever marker you use fairly easily, any basic encryption should work.

Indeed, you can, and there are situations in which it makes sense. However, it doesn’t really help when it comes to detecting abuse of this sort. For one, CGNAT causes problems. There’s also the issue of people linking to articles from sites like HN and Wayback Machine. Those two alone make it nearly impossible to automatically rate limit based on an ID in the URL.

CGNAT is a big issue that Western companies tend to neglect. However, it’s increasingly common in places like India, and it’s even seen at times in the US, especially in rural areas.

And, of course, public VPNs are growing in popularity.

Unfortunately, all of these factors mean that performing any sort of risk analysis or rate limiting on IP address alone tends to be ineffective or outright harmful for moderately large sites. You can do some fairly basic categorization (this is from a residential ISP, this is from a datacenter), but beyond that, it’s not particularly useful.

Hypothetically, let’s say:

1. We tag every URL with an IP address association in some way.

2. Someone posts a link on HN.

3. We see lots of requests with IP address tags that don’t match the actual requesting IP address, so we block or rate limit them.

4. We’ve just blocked traffic from HN.

Another hypothetical:

1. We design, calibrate, and test a rate limiting system in the US.

2. Some large percentage of real-world traffic comes from India and is behind CGNAT.

3. We’ve just rate-limited most of India.

4. So we exclude India.

5. But now we’ve rate-limited Nigeria, and malicious traffic from India isn’t blocked.

What we actually end up doing is similar but mostly relies cookies instead, and it’s only a single risk factor. It’s not perfect, and it has some caveats that the URL solution avoids, but it has far fewer false positives.

I have solved this kind of attacks with redis + app modification to count requests per ip per minute and auto add iptables rules to ban the offenders ip and deban it after xx minutes. Iptables rules are then synchronized to my fleet of front end servers.

I noticed Cloudflare is doing the same but 1 level deeper with XDP drop: https://blog.cloudflare.com/how-to-drop-10-million-packets/

This person used a new IP address for every single request, so that won’t work. And that’s a growing trend.
Yep I can come at someone with datacenter/residential/mobile/etc. IP addresses, all incredibly configurable to slip around network blocks. Luminati and Proxy Bonanza are the services I've used with the most success.

Getting source proxy lists of high-reputation networks is just $$ and a simple API integration game anymore.

There’s also the issue of CGNAT. If you rate limit too strictly based on IP address, you harm users who are stuck with CGNAT, especially in Asia and Africa. India is particularly problematic.

As for stuff like Luminati, if you’re being sufficiently sneaky, chances are you’re not going to snowball in the first place. I’m not sure why anyone would bother paying for Luminati to crawl sites like the one for which I work, but I have seen people use it to scam.

We can’t really be bothered to waste resources blocking well-behaved crawlers. Just keep it at a reasonable pace, respect errors (especially 429, but also 410 and 503), and ensure we have a way to contact you if things go wrong.

Yep - I have the HTTP error code detection dialed into an extreme because it's dumb to run a broken scrape anyway.

Frankly just any errors - if I see more than say 5-10 jobs fail within a 2-3 minute time period things are designed to wait X time, try again... and stop if they're still encountering errors and ping me to come in and investigate.

Faulty retry logic is just as dangerous as the forked/distributed run-off situation.

I'd assume they use 1000 distinct IPs because the system scaled to 1000 instances and presumably many of them came from related IPs. So it makes IP banning considerably harder, but not impossible.
More importantly, any reasonable IP-based approach would have a lot of false positives. What if there’s an RSS service like Feedly running on the same cloud?
Depending on your product, blacklisting all AWS IPs might be acceptable. For example my company has a VPN exit on AWS, which appears to be blacklisted by twitter.