Hacker News new | ask | show | jobs
by marginalia_nu 1666 days ago
I've had a lot of success just saving the data into gzipped tarballs, like a few thousand documents per tarball. That way I can replay the data and tweak the algorithms without causing traffic.
2 comments

Is that still practical even if you're storing the page text?

The reason I don't do that is because I have a few functions that analyze the job descriptions for relevance, but don't store the post text. I mostly did that to save space - I'm just aggregating links to relevant roles, not hosting job posts.

I figured saving ~1000 job descriptions would take up a needlessly large chunk of space, but truth be told I never did the math to check.

Edit: I understand scrapy does something similar to what you're describing; have considered using that as my scraper frontend but haven't gotten around to doing the work for it yet.

Yeah, sure. The text itself is usually at most a few hundred Kb, and HTML compresses extremely well. Like it's pretty slow to unpack and replay the documents, but it's still a lot faster than downloading them again.
And it's friendlier to the server you're getting the data from.

As a journalist, I have to scrape government sites now and then for datasets they won't hand over via FOIA requests ("It's on our site, that's the bare minimum to comply with the law so we're not going to give you the actual database we store this information in.") They're notoriously slow and often will block any type of systematic scraping. Better to get whatever you can and save it, then run your parsing and analysis on that instead of hoping you can get it from the website again.

First of all, thanks for marginalia.nu.

Have you considered stored compressed blobs in a sqlite file? Works fine for me, you can do indexed searches on your "stored" data, and can extract single pages if you want.

The main reason I'm doing it this way is because I'm saving this stuff to a mechanical drive, and I want consistent write performance and low memory overhead. Since it's essentially just an archive copy, I don't mind if it takes half an hour to chew through looking for some particular set of files. Since this is a format deigned for tape drives, it causes very little random access. It's important that it's relatively consistent to write since my crawler does while it's crawling, and it can reach speeds of 50-100 documents per second, which would be extremely rough on any sort of database based on a single mechanical hard drive.

These archives are just an intermediate stage that's used if I need to reconstruct the index to tweak say keyword extraction or something, so random access performance isn't something that is particularly useful.