Hacker News new | ask | show | jobs
by deusu 3752 days ago
(I'm not with CommonSearch. I have my own project that crawls extensively though.)

You do realize that you are talking about potentially a LOT of data?

To give you an example: The word "work" occurs on about 4% of all web-pages. So even if there were only about 2bn pages in an index, that would mean 80 million matching pages. Even if you only need their URLs that would be about 2.4gb of data assuming an average URL length of 30 bytes. Ok, compression can make that smaller, but still...

It would also mean that the server would need to make 80 million random reads to get the URLs. Even with SSDs that would take some time. Hmm, actually in this case it may be faster to just read all URL-data sequentially, than doing random reads. But in both cases we would be talking about minutes needed to get all that data from disk.

I currently have a search-index with about 1.2bn pages - I expect to reach 2bn pages by mid-May - that could be used to get the kind of data you need. But not in a realtime API. Not that amount of result-data.

2 comments

1) I'd be very interested in such a service. 2) Yep, it's a lot, but that query is quite a lot bigger than most. Assuming some constraints on the layout of the index, I estimate you'd spend roughly $70 plus taxes and compute time retrieving the indexed documents from S3 for that query. You'd always be able to reduce or expand the keywords to and only retrieve as much as you could afford. I think there's value in both allowing people to tackle querying the index by themselves and providing a paid-for managed service that automates much of that.
Interesting. To be honest, a static data set would be perfectly fine for a first batch processing attempt.

> that could be used to get the kind of data you need.

Cool. Would you be interested in sharing or exchanging data?

I'm always open to new business opportunities. :)

What would be more useful to you, the raw data - meaning for each page a list of the keywords on it - or the reverse-word-index?

Raw-data may be better for batch-processing or running multiple queries at the same time.

My crawler currently outputs about 40-45gb of raw-data per day (about 30 million pages). Full crawl will be 2bn pages, updated every 2-3 months.

The reverse-word-index would be about 18gb per day for the same number of pages.

Reverse-word-index is already compressed, raw-data isn't.

There is a small problem with the crawl though, as it does not always handle non-ascii characters on pages correctly. I'm working on that.

BTW: I also currently have a list of about 8.5bn URLs from the crawl. About 600gb uncompressed. These are the links on the crawled pages. Obviously not all of those will end up being crawled.