Hacker News new | ask | show | jobs
by registeredcorn 534 days ago
I'm just beginning to learn about curl and wget. Can anyone recommend similar resources to this one that emphasize politeness?

For example, I'd like to grab quite a few books from archive.org, but want to use their torrent option, when available. I don't like the idea of "slamming" their site because I'm trying to grab 400 books at once.

1 comments

Some basic stuff on curl rate limiting at https://everything.curl.dev/usingcurl/transfers/rate-limitin...
Thank you kindly for the recommendation! It's much appreciated. I look forward to using that along with --rate to lower the requests themselves, since --limit-rate is for the average speed, not necessarily the max speed.

If you don't mind, I'd like to pick your brain for two other questions. :)

Question 1. Do you know of any sort of commonly used "normal" speed? I have been using --limit-rate 50k. I have the ability to go much faster, but I don't know how fast is too fast. 100k? 500k? 1m? 100g?! 1m is probably too much, but I'm not sure by how much.

I was thinking there might be a way to click around the site with DevTools -> Network and observe how quickly things are moving around, then stay under that threshold, but I don't know if there's a more obvious solution I'm not thinking of.

Question 2. Regarding `robots.txt`, the linked article mentions:

> If a site doesn't specify a crawl-delay in robots.txt, I default to one request every five seconds. If I get 429s, I slow down.

Is the author trying to say: "If `robots.txt` DOES specify a `crawl-delay` or `limit-rate` value, curl and wget will AUTOMATICALLY obey that specified value"?

Or, is it simply: "I MANUALLY check foo.bar/robots.txt and MANUALLY configure `crawl-delay` and/or `limit-rate` to the specified value. Otherwise, I set it to 5 (or higher, if I start getting 429'd)"?

I'm guessing the latter, but it'd be sweet if it's the former. It would make sense for an automatic tool to have an automatic configuration.