Hacker News new | ask | show | jobs
by funnyflamigo 1663 days ago
Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages?

Let's say you're scraping product info from a large list of products. I'm assuming you mean if it's strange one-off type errors to handle those, and you'd stop altogether if too many fail? Otherwise you'd just be DOS'ing the site.

1 comments

Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages?

Sure! I can get a little more concrete about this project more easily than I can comment on your hypothetical about a large list of products, though, so forgive me in advance for pivoting on the scenario here.

I'm scraping job pages. Typically, one job posting == one link. I can go through that link for the job posting and extract data from given HTML elements using CSS selectors or XPath statements. However, sometimes the data I'm looking for isn't structured in a way I expect. The major area I notice variations in job ad data is location data. There are a zillion little variations in how you can structure the location of a job ad. City+country, city+state+country, comma separated, space separated, localized states, no states or provinces, all the permutations thereof.

I've written the extractor to expect a certain format of location data for a given job site - let's say "<city>, <country>", for example. If the scraper comes across an entry that happens to be "<city>, <state>, <country>", it's generally not smart enough to generalize its transform logic to deal with that. So, to handle it, I mark that particular job page link as needing human review, so it pops up as an ERROR in my logs, and as an entry in the database that has post_status == 5. After that, it gets inserted into the database, but not posted live onto the site.

That way, I can go in and manually fix the posting, approve it to go on the site (if it's relevant), and, ideally, tweak the scraper logic so that it handles transforms of that style of data formatting as well as the "<city>, <country>" format I originally expected.

Does that make sense?

I suspect I'm just writing logic to deal with malformed/irregular entries that humans make into job sites XD

I've had a lot of success just saving the data into gzipped tarballs, like a few thousand documents per tarball. That way I can replay the data and tweak the algorithms without causing traffic.
Is that still practical even if you're storing the page text?

The reason I don't do that is because I have a few functions that analyze the job descriptions for relevance, but don't store the post text. I mostly did that to save space - I'm just aggregating links to relevant roles, not hosting job posts.

I figured saving ~1000 job descriptions would take up a needlessly large chunk of space, but truth be told I never did the math to check.

Edit: I understand scrapy does something similar to what you're describing; have considered using that as my scraper frontend but haven't gotten around to doing the work for it yet.

Yeah, sure. The text itself is usually at most a few hundred Kb, and HTML compresses extremely well. Like it's pretty slow to unpack and replay the documents, but it's still a lot faster than downloading them again.
And it's friendlier to the server you're getting the data from.

As a journalist, I have to scrape government sites now and then for datasets they won't hand over via FOIA requests ("It's on our site, that's the bare minimum to comply with the law so we're not going to give you the actual database we store this information in.") They're notoriously slow and often will block any type of systematic scraping. Better to get whatever you can and save it, then run your parsing and analysis on that instead of hoping you can get it from the website again.

First of all, thanks for marginalia.nu.

Have you considered stored compressed blobs in a sqlite file? Works fine for me, you can do indexed searches on your "stored" data, and can extract single pages if you want.

The main reason I'm doing it this way is because I'm saving this stuff to a mechanical drive, and I want consistent write performance and low memory overhead. Since it's essentially just an archive copy, I don't mind if it takes half an hour to chew through looking for some particular set of files. Since this is a format deigned for tape drives, it causes very little random access. It's important that it's relatively consistent to write since my crawler does while it's crawling, and it can reach speeds of 50-100 documents per second, which would be extremely rough on any sort of database based on a single mechanical hard drive.

These archives are just an intermediate stage that's used if I need to reconstruct the index to tweak say keyword extraction or something, so random access performance isn't something that is particularly useful.

Have you thought about pushing the links onto a queue and running multiple scrapers off that queue? You'd need to build in some politeness mechanism to make sure you're not hitting the same domain/ip address too often but it seems like a better option than a serial process.
Why 5, exactly? This struck me as odd in the article. Perhaps I missed something. Are there other statuses? Why are statuses numeric?
It's arbitrary.

I have a field, post_status, in my backend database, that I use to categorize posts. Each category is a numeric code so SQL can filter it relatively quickly. I have statuses for active posts, dead posts, ignored links, links needing review, etc.

It's a way for me to sort through my scraper's results quickly.

I think you have a case of premature optimisation there, as I wrote in a recent comment[0].

[0]: https://news.ycombinator.com/item?id=29430281

Not sure what's premature here. The optimization is to allow me, a human, to find a certain class of database records quickly. I also chose a method that I understand to be snappy on the SQL side as well.

What would you suggest as a non-optimized alternative? That might make your point about premature optimization clearer.

There is indeed a trade-off, and the direction I would have chosen is to use meaningful status names as opposed to magic numbers. My reasoning being that maintenance cost in terms of how self-explanatory the system is makes more sense to me economically than obscuring the meaning behind some of the code/data for a practically non-existent performance benefit.

After all, hardware is cheap, but developer time isn't.

For a more concrete example, I might have chosen the value `'pending'` (or similar) instead of `5`. Active listings might have status `'active'`. Expired ones might have status `'expired'`, etc.

Integer columns are significantly faster and smaller than strings in a SQL database. It adds up quickly if you have a sufficiently large database.

I use the following scheme:

   1 - exhausted
   0 - alive
  -1 - blocked (by my rules)
  -2 - redirected
  -3 - error
The author is scraping fewer than 1,000 records per day, or roughly 365,000 records per year.

On my own little SaaS project, the difference between querying an integer and a varchar like “active” is imperceptible, and that’s in a table with 7,000,000 rows.

It would take the author 19 years to run into the scale that I’m running at, where this optimisation is meaningless. And that’s assuming they don’t periodically clean their database of stale data, which they should.

So this looks like a premature optimisation to me, which is why it stood out as odd to me in the article.

I'd put it closer to the category of best practices than premature optimization. It's pretty much always a good idea. It's not that not doing this will break things, the alternative is slower and uses more resources in a way that affects all queries since larger datatypes elongate the records, and locality is tremendously important all aspects of software performance.
I disagree. I think a better "best practice" is to make the meaning behind the code as clear as possible. In this case, the code/data is less clear, and there is zero performance benefit.
It's arbitrary.