| This guide (and most other guides) are missing a massive tip: Separate the crawling (finding urls and fetching the HTML content) from the scraping step (extracting structured data out of the HTML). More than once, I wrote a scraper that did both of these steps together. Only later I realized that I forgot to extract some information that I need and had to do the costly task of re-crawling and scraping everything. If you do this in two steps, you can always go back, change the scraper and quickly rerun it on historical data instead of re-crawling everything from scratch. |