Thank you kindly for the recommendation! It's much appreciated. I look forward to using that along with --rate to lower the requests themselves, since --limit-rate is for the average speed, not necessarily the max speed.
If you don't mind, I'd like to pick your brain for two other questions. :)
Question 1. Do you know of any sort of commonly used "normal" speed? I have been using --limit-rate 50k. I have the ability to go much faster, but I don't know how fast is too fast. 100k? 500k? 1m? 100g?! 1m is probably too much, but I'm not sure by how much.
I was thinking there might be a way to click around the site with DevTools -> Network and observe how quickly things are moving around, then stay under that threshold, but I don't know if there's a more obvious solution I'm not thinking of.
Question 2. Regarding `robots.txt`, the linked article mentions:
> If a site doesn't specify a crawl-delay in robots.txt, I default to one request every five seconds. If I get 429s, I slow down.
Is the author trying to say: "If `robots.txt` DOES specify a `crawl-delay` or `limit-rate` value, curl and wget will AUTOMATICALLY obey that specified value"?
Or, is it simply: "I MANUALLY check foo.bar/robots.txt and MANUALLY configure `crawl-delay` and/or `limit-rate` to the specified value. Otherwise, I set it to 5 (or higher, if I start getting 429'd)"?
I'm guessing the latter, but it'd be sweet if it's the former. It would make sense for an automatic tool to have an automatic configuration.
If you don't mind, I'd like to pick your brain for two other questions. :)
Question 1. Do you know of any sort of commonly used "normal" speed? I have been using --limit-rate 50k. I have the ability to go much faster, but I don't know how fast is too fast. 100k? 500k? 1m? 100g?! 1m is probably too much, but I'm not sure by how much.
I was thinking there might be a way to click around the site with DevTools -> Network and observe how quickly things are moving around, then stay under that threshold, but I don't know if there's a more obvious solution I'm not thinking of.
Question 2. Regarding `robots.txt`, the linked article mentions:
> If a site doesn't specify a crawl-delay in robots.txt, I default to one request every five seconds. If I get 429s, I slow down.
Is the author trying to say: "If `robots.txt` DOES specify a `crawl-delay` or `limit-rate` value, curl and wget will AUTOMATICALLY obey that specified value"?
Or, is it simply: "I MANUALLY check foo.bar/robots.txt and MANUALLY configure `crawl-delay` and/or `limit-rate` to the specified value. Otherwise, I set it to 5 (or higher, if I start getting 429'd)"?
I'm guessing the latter, but it'd be sweet if it's the former. It would make sense for an automatic tool to have an automatic configuration.