| > The ex-Googler reflected that he missed the possibility of pages that link back to each other, causing "infinite recursion." Although tangential to the billing issue, this is reckless. If you’re building a crawler of any kind, please, please, please prioritize ensuring this doesn’t happen so I don’t have to wake up at 3 AM. I run the infrastructure for a moderate-sized site with probably about a hundred million pages or so. We can handle the HN hug-of-death just fine. But poorly-made crawlers that recurse like this? They’re increasingly problematic. If your solution to fixing your crawler is “throw more concurrency at it and ignore the recursion,” and suddenly your requests start timing out, that’s a pretty damn strong hint that you’re ruining someone’s day. From my perspective, this will look like an attack. I’ll see thousands of IP addresses repeatedly requesting the same pages, usually with generic user agent headers. Which ones are actual attacks, and which are just poorly-made crawlers? Well, if you’ve got a generic user agent string that doesn’t link to a contact page, and you’re circumventing rate limiting by changing your IP address, and you had the bright idea to let your test code run overnight, I’m going to treat it as an attack. At 3 AM, I’m not inclined to differentiate between negligence and malice. This is happening more and more often, and I partially blame it on the ease of “accidentally” obtaining a ridiculous quantity of cloud resources. People deploy shoddy test code and go to bed. They turn it off in the morning when they see the bill. It’s become so prevalent that our company has come up with an internal term for these crawlers that spin up a new thread/container for every page: snowballing crawlers. Save a sysadmin: don’t snowball. Oh, and include a useful user agent header so we can contact you instead of your cloud provider. |
Puppeteer etc. are nice and all but if you can get away with raw HTTP requests grabbing and parsing the HTML without pulling down stylesheets, JS, etc. do it. It is WAY more efficient than requesting the full overhead for the user experience from these folks and threading out 5-10 workers to gracefully crawl a site this way doesn't typically cause things to melt down on your target's end.
You may be saying "well I need a browser-stack or evaluated JS to do my work" and you may be right... but honestly though 90% of this stuff is reverse-engineer-able with Charles Proxy and some basic webdev experience. Heck - I've even sandboxed JS from a target's site to generate tokens/etc to cut down on repeat requests. Even CAPTCHA stuff can easily be done without having to pull down full UIX overhead these days.
---
"Save a sysadmin: don’t snowball."
Implement thread limits, rate limiting, throttling, intelligent caching, and try to fit within your target's hosting capabilities without being disrespectful. Often I will "smear" large jobs over weeks worth of time so that it's only a trickle of traffic here and there (and to also fly under the radar... sorry).
Also - on the custom UAS: Unless you're trying to make it easy to get blocked/identified then don't take this advice. Let's face it - this is a gray area for most. The best way is to not "snowball" and to make your scrapers indistinguishable from a reasonable stream of real users from real networks. I would never expect a sysadmin to contact me because frankly they aren't paid to.
---
One last thought - the people who are out there writing these bots/crawlers/etc. are often the lowest common denominator. They're the type that will get something "working" and hurry onto the next job because the nature of the work tends to be a ton of low-paid contract stuff. Also, at almost every place I've worked at in ecommerce that has scraping involved it's the bottom-rung dev talent that's assigned to the work.
Sucks, but near-100% I attribute your "snowball" situation to that.