Hacker News new | ask | show | jobs
by renegat0x0 744 days ago
My 5 cents:

- status codes 200-299 are all OK

- status codes 300-399 are redirects, and also can be OK eventually

- 403 in my experience occurs quite often, where it is not an error, but suggestion that your user agent is not OK

- robots.txt should be scanned to check if any resource is prohibited, or if there are speed requirements. It is always better to be _nice_. I plan to add something like that and also missing it in my project

- It would be interesting to generate hash from app, and update only if hash is different?

2 comments

Status codes, I am displaying the list because mostly on a JavaScript driven application you don't want other codes than 200 (besides media).

I thought about robots.txt but as this is a software that you are supposed to run against your own website I didn't consider it worthy. You have a point on speed requirements and prohibited resources (but is not like skipping over them will add any security).

I haven't put much time/effort into an update step. Currently, it resumes if the process exited via checkpoints(it saves current state every 250 URLs, if any is missing then it can continue, else it will be done)

Thanks, btw what's your project!? Share!

I agree with your points.

You might be interested in reddit webscraping thread https://www.reddit.com/r/webscraping/

My passion project is https://github.com/rumca-js/Django-link-archive

Currently I use only one thread for scraping, I do not require more. It gets the job done. Also I know too little to play more with python "celery" threads.

My project can be used for various things. Depends on needs. Recently I am playing with using it as a 'search engine'. I am scraping the Internet to find cool stuff. Results are in https://github.com/rumca-js/Internet-Places-Database. No all domains are interesting though.

> Status codes, I am displaying the list because mostly on a JavaScript driven application you don't want other codes than 200 (besides media).

What? Why? Regardless of the programming language used to generate content, the standard, well known HTTP status codes should be returned as expected . If your JS served site, gives me a 200 code when it should be a 404, you're wrong.

I think you are misunderstanding, your application is expected to give mostly 200s codes, if you get a 404, then a link is broken or a page misbehaving which is exactly why that page url is displayed on the console with a warning.
In many cases, 403 is really 404 on things like S3.