Hacker News new | ask | show | jobs
by smcin 530 days ago
Some basic stuff on curl rate limiting at https://everything.curl.dev/usingcurl/transfers/rate-limitin...
1 comments

Thank you kindly for the recommendation! It's much appreciated. I look forward to using that along with --rate to lower the requests themselves, since --limit-rate is for the average speed, not necessarily the max speed.

If you don't mind, I'd like to pick your brain for two other questions. :)

Question 1. Do you know of any sort of commonly used "normal" speed? I have been using --limit-rate 50k. I have the ability to go much faster, but I don't know how fast is too fast. 100k? 500k? 1m? 100g?! 1m is probably too much, but I'm not sure by how much.

I was thinking there might be a way to click around the site with DevTools -> Network and observe how quickly things are moving around, then stay under that threshold, but I don't know if there's a more obvious solution I'm not thinking of.

Question 2. Regarding `robots.txt`, the linked article mentions:

> If a site doesn't specify a crawl-delay in robots.txt, I default to one request every five seconds. If I get 429s, I slow down.

Is the author trying to say: "If `robots.txt` DOES specify a `crawl-delay` or `limit-rate` value, curl and wget will AUTOMATICALLY obey that specified value"?

Or, is it simply: "I MANUALLY check foo.bar/robots.txt and MANUALLY configure `crawl-delay` and/or `limit-rate` to the specified value. Otherwise, I set it to 5 (or higher, if I start getting 429'd)"?

I'm guessing the latter, but it'd be sweet if it's the former. It would make sense for an automatic tool to have an automatic configuration.