| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 0x59 23 days ago
	Idk, if bots ate hammering your server then setup rate limits. If you have content that you don't want others to have access to, don't serve it with a webserver.

6 comments

TkTech 23 days ago

I used to just start giving any IP downloading way too much a redirect to multi-tb NASA images. This was a long time ago but it was surprisingly how many would follow redirects and never time out. Wouldn't see a request again for hours and then its right back to downloading a new part of the sky.

Those images also used to crash all the early GUI irc and chat clients that showed inline images without size checks...

link

mcosta 23 days ago

How do you know it followed the redirect and downloaded the image?

link

timbit42 22 days ago

Because it didn't come back for hours.

link

dotancohen 23 days ago

How were you tracking each IP address's data usage? Did you parse the logs every request? Store usage in a database? At the application or webserver level?

link

TkTech 23 days ago

Webalayzer! I'm not sure there were really any other options at the time other than writing your own. Parsed the apache logs and gave you pretty detailed results and you could see the usage (in kb, which tells you how long ago this was!) broken down by date and IP.

Once you added a redirect rule for the IP to apache you'd just check your log and see the IP that was hitting you every couple of minutes poofed for a good few hours.

link

dotancohen 22 days ago

Now that's a name I've not heard in a long time.

That's nuts. I suppose you had Webalayzer on a minutely cron job. It might have been drawing more resources than Apache itself!

link

pmdr 23 days ago

This. What even is the point of blocking scapers if Google consumes your content anyway and serves it as an AI answer?

These are sad times we're living as far as openness of the web goes. People would have less of a scraping problem if their websites didn't ship with 20MB of JS.

link

remus 23 days ago

> What even is the point of blocking scapers if Google consumes your content anyway and serves it as an AI answer?

Google bot is generally fairly well behaved, but this is not the case for all scrapers and it can cause significant traffic (and expense).

link

miki123211 22 days ago

There is something to be said for "one way indexes."

Imagine you run a company register for a local government. You want to let people look up companies by their registration number (which they must disclose in all communications to you) to see if they're legit and whether any warnings have been raised against them. You don't want unscrupulous marketers to just be able to `SELECT * FROM companies WHERE type='nail_salon' AND city='london'`.

If you aren't super strict about scraping, some shadowy business in Neverland, completely unconcerned with following your laws, will build that database.

link

bashkiddie 22 days ago

> Imagine you run a company register for a local government.

Is this data not public for some reason? I think it will not hurt if there are multiple copies spread between public offices and private companies. What really hurts is a private company hammering your webserver for their own profit. They should get their own copy.

link

0x59 22 days ago

If the purpose of the index is to allow people to lookup registration and warnings, probably just serve the list. This is public information and doesn't need to be gated. CSV header could be:

Reg_no, status, no_warnings_last_12m

link

MartijnHols 22 days ago

Rate limits don’t work if bots rotate IPs from residential blocks on every request.

link

jeroenhd 22 days ago

I have blocked several Asian countries because their IP ranges kept sending stupid scrapers that repeatedly downloaded the same image with a made-up query, bursting through the basic cache setup. Now a billion or so people can't acces my server.

Rate limits didn't work because they kept rotating IP addresses.

I'm pretty sure Turnstyle would allow more people through than my current solution, but this was quick and easy. I expect to have to ban more ASNs from other countries in the future but the worst bots are now gone.

link

BenjiWiebe 22 days ago

I would LOVE to be able to use rate limits (well actually, since I'm dealing with fraud not scraping, I'd ban the IP).

I can't, because every request comes from a new IP!!!

link