Hacker News new | ask | show | jobs
by forlorn 2062 days ago
Web scraping only sounds like a simple thing but when you add multithreading, queues, workers and rate limiting it becomes a real monster.
2 comments

Please let me add as well more specific complexity like depth of the pages (of a site), length of parameters of dynamically generated links (which can potentially be infinite if there is a circular perpetually "adding" mechanism in the website's code), upper/lowercase characters in links (irrelevant for the protocol & domain but relevant for the rest like path and parameters), etc... .

I just started with this theme and I'm having a lot of unexpected "fun" :)

any tool you know that deals with these? Been using the offline version of apify, multithreading, queues, workers seem to be good, does not seem to do rate limiting.