Hacker News new | ask | show | jobs
by placidpanda 1312 days ago
When doing this in the past, I settled on an sqlite database with one table that stores the compressed html (gzip or lzma) along with other columns (id/date/url/domain/status/etc.)

Also made it easy to alert on when something broke (query the table for count(*) where status=error) and rerun the parser for failures.

1 comments

Yup. A database gives you all the performance AND flexibility you need. MySQL or PostgreSQL will work well too.

Storing pages as files is a no-go because it wastes way too much disk space due to block sizes. While more customized cache tools will never be as flexible or have as much tooling as a widely supported relational database.

For even better compression use a preset dictionary as well tuned to a wide sample of HTML, but it doesn't sound like you need to go that far.