Hacker News new | ask | show | jobs
by bndr 118 days ago
I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.

Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.

5 comments

Very interesting!

Yes, in this day and age, I could definitely see web pages being harder to crawl by search engines (and SEO companies and other users of automated web crawling technologies (AI agents?)) than they were in the early days of the Internet due to many possible causes -- many of which you've excellently described!

In other words, there's more to be aware of for anyone writing a search engine (or search-engine-like piece of software -- SEO, AI Agent, etc., etc.) than there was in the early days of the Internet, where everything was straight unencrypted http and most URLs were easily accessible without having to jump through additional hoops...

Which leads me to wonder... on the one hand, a website owner may not want bots and other automated software agents spidering their site (we have ROBOTS.TXT for this), but on the flip side, most business owners DO want publicity and easy accessibility for sales and marketing purposes, thus, they'd never want to issue a 403 (or other error code) for any public-facing product webpage...

Thus there may be a market for testing public facing business/product websites against faulty "I can't give you that web page for whatever reason" error codes from a wide variety of clients, from a wide variety of locations around the world.

That market is related to the market for testing if a website is up and functioning properly (the "uptime market"), again, from a wide variety of locations around the world, using a wide variety of browsers...

So, a very interesting post!

Also (for future historians!) compare all of the restrictive factors which may prevent access to a public-facing web page today Vs. Tim Berners-Lee original vision for the web, which was basically to let scientists (and other academic types!) SHARE their data PUBLICLY with one another!

(Things have changed... a bit! :-) )

Anyway, a very interesting post, and a very interesting article -- for both present and future Search Engine programmers!

I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.
In Italy it’s a crime punishable up to 12 years to access any protected computer system without authorization, especially if it causes a DoS to the owner

Consider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)

https://www.brocardi.it/codice-penale/libro-secondo/titolo-x...

Can't your users just whitelist your IPs?
I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.

And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.

They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.
Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?
The right solution is to be registered at Cloudflare, but then getting the customer reach the guy who handles Cloudflare settings (a few clicks) is the hard part.
Blocking seems really popular. I wonder if it coincides with stack overflow closing.
Just stop scraping. I'll do everything to block you.
> in my case, users add their own domains

Seems like they're only scraping websites their clients specifically ask them to

Now you've gamified it :)
It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.

There's no point in playing tug of war against unethical actors, just ban them and be done with it.

I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.

So you're blocking the absolute bottom of the barrel scrapers and feel like you 'won' because you don't even notice any scraper that isn't complete trash.

Then again why block them if they don't cause any issue in the first place? Instead of going ballistic on IPs that you don't vibe with you could also just do proper rate limiting.

If you think the game is played on a single IP address, you are not adept enough to be weighing in on this discussion.
What is the crawler is using a shared IP and you end up blocking legitimate users with the bad actor?
He said "it's pretty easy", probably not realizing there are whole industries on both sides of that cat and mouse game, making it not easy.