|
|
|
|
|
by renegat0x0
744 days ago
|
|
My 5 cents: - status codes 200-299 are all OK - status codes 300-399 are redirects, and also can be OK eventually - 403 in my experience occurs quite often, where it is not an error, but suggestion that your user agent is not OK - robots.txt should be scanned to check if any resource is prohibited, or if there are speed requirements. It is always better to be _nice_. I plan to add something like that and also missing it in my project - It would be interesting to generate hash from app, and update only if hash is different? |
|
I thought about robots.txt but as this is a software that you are supposed to run against your own website I didn't consider it worthy. You have a point on speed requirements and prohibited resources (but is not like skipping over them will add any security).
I haven't put much time/effort into an update step. Currently, it resumes if the process exited via checkpoints(it saves current state every 250 URLs, if any is missing then it can continue, else it will be done)
Thanks, btw what's your project!? Share!