| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by deusu 3753 days ago

I'm always open to new business opportunities. :)

What would be more useful to you, the raw data - meaning for each page a list of the keywords on it - or the reverse-word-index?

Raw-data may be better for batch-processing or running multiple queries at the same time.

My crawler currently outputs about 40-45gb of raw-data per day (about 30 million pages). Full crawl will be 2bn pages, updated every 2-3 months.

The reverse-word-index would be about 18gb per day for the same number of pages.

Reverse-word-index is already compressed, raw-data isn't.

There is a small problem with the crawl though, as it does not always handle non-ascii characters on pages correctly. I'm working on that.

BTW: I also currently have a list of about 8.5bn URLs from the crawl. About 600gb uncompressed. These are the links on the crawled pages. Obviously not all of those will end up being crawled.