Hacker News new | ask | show | jobs
by peterangular 2021 days ago
Also - as someone with a ton of experience on the other side of this coin:

Puppeteer etc. are nice and all but if you can get away with raw HTTP requests grabbing and parsing the HTML without pulling down stylesheets, JS, etc. do it. It is WAY more efficient than requesting the full overhead for the user experience from these folks and threading out 5-10 workers to gracefully crawl a site this way doesn't typically cause things to melt down on your target's end.

You may be saying "well I need a browser-stack or evaluated JS to do my work" and you may be right... but honestly though 90% of this stuff is reverse-engineer-able with Charles Proxy and some basic webdev experience. Heck - I've even sandboxed JS from a target's site to generate tokens/etc to cut down on repeat requests. Even CAPTCHA stuff can easily be done without having to pull down full UIX overhead these days.

---

"Save a sysadmin: don’t snowball."

Implement thread limits, rate limiting, throttling, intelligent caching, and try to fit within your target's hosting capabilities without being disrespectful. Often I will "smear" large jobs over weeks worth of time so that it's only a trickle of traffic here and there (and to also fly under the radar... sorry).

Also - on the custom UAS: Unless you're trying to make it easy to get blocked/identified then don't take this advice. Let's face it - this is a gray area for most. The best way is to not "snowball" and to make your scrapers indistinguishable from a reasonable stream of real users from real networks. I would never expect a sysadmin to contact me because frankly they aren't paid to.

---

One last thought - the people who are out there writing these bots/crawlers/etc. are often the lowest common denominator. They're the type that will get something "working" and hurry onto the next job because the nature of the work tends to be a ton of low-paid contract stuff. Also, at almost every place I've worked at in ecommerce that has scraping involved it's the bottom-rung dev talent that's assigned to the work.

Sucks, but near-100% I attribute your "snowball" situation to that.

1 comments

> Also - on the custom UAS: Unless you're trying to make it easy to get blocked/identified then don't take this advice.

I can’t speak for other sites, but we’re pretty good at picking up on crawlers that don’t have a unique UA. The problem is that we’re going to have a hard time differentiating your well-behaved crawler from more malicious crawlers, and you’re going to get caught in the crossfire.

> if you can get away with raw HTTP requests grabbing and parsing the HTML without pulling down stylesheets, JS, etc. do it.

If you combine that with the lack of an identifying UA, there’s unfortunately a good chance you’ll get caught in the crossfire during an actual attack. That being said, it’s good advice otherwise. If you’re trying not to be identified as a crawler, it’s really going to stand out, though.

> I would never expect a sysadmin to contact me because frankly they aren't paid to.

I am. Furthermore, as long as you’re being transparent about your activity (see: UA), I don’t mind working with you instead of your provider. I understand that writing good crawlers is a learning experience; mistakes do happen. When I send abuse reports, usually people just get a slap on the wrist, but not everyone is that lucky.

But, if your UA has contact info, I can:

1. Easily rate limit or block you until the issue is resolved

2. Contact you directly, explaining exactly what’s wrong

3. Easily unblock you once it’s fixed

Sure, I’m not going to be happy about it, but I’m going to be a lot happier than if you try to blend in—a situation in which I’m not going to have any sympathy.

Unfortunately, most sites don’t respond that way and would rather just block anything remotely suspicious. But since you can always change your IP address, maybe try with an identifiable UA first—please? :)

Edit: Also, a few recommendations to add:

1. Be prepared to handle obscure HTTP status codes. 503 indicates you need to back off. Frequent 500, 502, or 504 means the same thing. 429 and 420 mean you’re being rate limited; slow down. 410 means you should stop requesting the given URL. 400 or 405 means you probably have a bug. Any unrecognized 4XX or 5XX error should be flagged and examined so you can handle it better in the future.

2. You can send an X-Abuse-Info header and a generic UA if you want capable sysadmins to be able to identify you but want to avoid being blocked by inexperienced webmasters.

3. Don’t ignore abuse reports.

4. Try to be consistent and ramp up slowly. It’s harder to cope with unnaturally-abrupt increases in traffic.

(2) Is a great idea I hadn't considered. A surprising number of sites require "browser" user-agents but otherwise have well-defined rate limits, robots.txt files, and everything you'd need to write a respectful crawler.

I'm not sure that (4) matters for larger sites? Their rate limits are usually a drop in the bucket compared to the background traffic.

#4 was more to avoid being noticed by someone like me before they’ve had their morning coffee. That being said, if anything does go wrong, and you’ve ramped up slowly, at least it gives autoscaling time to respond.

Generally, though, unless you screw up badly, submit forms, or blend in with a more problematic crawler, nobody’s going to care (or even notice).