Hacker News new | ask | show | jobs
by rfurmani 445 days ago
After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.
2 comments

Has someone made honeypot for AI yet?

Take all regular papers and change their words or keywords to something outrageous and watch it feed it to users.

This kinda fits, though it's on a personal blog level:

https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scr...

If there was a non-profit dedicated do this, I would donate
One thing that worked well for me was layering obstacles

It really sucks that this is the way things are, but what I did was

10 requests for pages in a minute, you get captchad (with a little apology and the option to bypass it by logging in). asset loads don’t count

After a captcha pass, 100 requests in an hour gets you auth walled

It’s really shitty but my industry is used to content scraping.

This allows legit users to get what they need. Although my users maybe don’t need prolonged access ahem.

What happens if you use the proper rate limiting status of 429? It includes a next retry time [1]. I'm curious what (probably small) fraction would respect it.

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

Probably makes sense for a b2b app where you publish status codes as part of the api

Bad actors don’t care and annoying actors would make fun of you for it on twitter

I've wanted to but wasn't sure how to keep track of individuals. What works for you? IP Addresses, cookies, something else?
I use IP addy. Users behind cgnat are already used to getting captcha the first time around

There’s some stuff you can do, like creating risk scores (if a user changes ip and uses the same captcha token, increase score). Many vendors do that, as does my captcha provider.

> This allows legit users to get what they need.

Of course they could have just used the site directly.

If bots and scrapers respected the robots and tos, we wouldn’t be here

It sucks!

Or just buy cloudflare :)
What is your website?