|
|
|
|
|
by deusu
3753 days ago
|
|
I'm always open to new business opportunities. :) What would be more useful to you, the raw data - meaning for each page a list of the keywords on it - or the reverse-word-index? Raw-data may be better for batch-processing or running multiple queries at the same time. My crawler currently outputs about 40-45gb of raw-data per day (about 30 million pages). Full crawl will be 2bn pages, updated every 2-3 months. The reverse-word-index would be about 18gb per day for the same number of pages. Reverse-word-index is already compressed, raw-data isn't. There is a small problem with the crawl though, as it does not always handle non-ascii characters on pages correctly. I'm working on that. BTW: I also currently have a list of about 8.5bn URLs from the crawl. About 600gb uncompressed. These are the links on the crawled pages. Obviously not all of those will end up being crawled. |
|