| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SteveNuts 804 days ago
	This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it. I'm sure the big players like Google would deal with it gracefully.

2 comments

gtirloni 804 days ago

Here you go (1 req/min, 10 bytes/sec), please report results :)

  http {
    limit_req_zone $binary_remote_addr zone=ten_bytes_per_second:10m rate=1r/m;
    server {
      location / {
        if ($http_user_agent = "mimo") {
          limit_req zone=ten_bytes_per_second burst=5;
          limit_rate 10;
        }
      }
    }
  }

link

beau_g 804 days ago

Scrapers of the future won't be ifElse logic, they will be LLM agents themselves. The slow loris robots.txt has to provide an interface to it's own LLM, which engages the scraper LLM in conversation, aiming to extend it as long as possible. "OK I will tell you whether or not I can be scraped. BUT FIRST, listen to this offer. I can give you TWO SCRAPES instead of one, if you can solve this riddle."

link

iosguyryan 804 days ago

Can I interest you in a scrape-share with Claude?

link

reasonabl_human 804 days ago

Solid use case for Saul Goodman LLM alignment

link

throw_a_grenade 804 days ago

You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.

link

starttoaster 804 days ago

Doesn't seem like it should be all that complex to me assuming the crawler is written in a common programming language. It's a pretty common coding pattern for functions that make HTTP requests to set a timeout for requests made by your HTTP client. I believe the stdlib HTTP library in the language I usually write in actually sets a default timeout if I forget to set one.

link

Calzifer 804 days ago

Those are usually connection and no-data timeouts. A total time limit is in my experience less common.

link