|
|
|
|
|
by funnyflamigo
1663 days ago
|
|
Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages? Let's say you're scraping product info from a large list of products. I'm assuming you mean if it's strange one-off type errors to handle those, and you'd stop altogether if too many fail? Otherwise you'd just be DOS'ing the site. |
|
Sure! I can get a little more concrete about this project more easily than I can comment on your hypothetical about a large list of products, though, so forgive me in advance for pivoting on the scenario here.
I'm scraping job pages. Typically, one job posting == one link. I can go through that link for the job posting and extract data from given HTML elements using CSS selectors or XPath statements. However, sometimes the data I'm looking for isn't structured in a way I expect. The major area I notice variations in job ad data is location data. There are a zillion little variations in how you can structure the location of a job ad. City+country, city+state+country, comma separated, space separated, localized states, no states or provinces, all the permutations thereof.
I've written the extractor to expect a certain format of location data for a given job site - let's say "<city>, <country>", for example. If the scraper comes across an entry that happens to be "<city>, <state>, <country>", it's generally not smart enough to generalize its transform logic to deal with that. So, to handle it, I mark that particular job page link as needing human review, so it pops up as an ERROR in my logs, and as an entry in the database that has post_status == 5. After that, it gets inserted into the database, but not posted live onto the site.
That way, I can go in and manually fix the posting, approve it to go on the site (if it's relevant), and, ideally, tweak the scraper logic so that it handles transforms of that style of data formatting as well as the "<city>, <country>" format I originally expected.
Does that make sense?
I suspect I'm just writing logic to deal with malformed/irregular entries that humans make into job sites XD