|
|
|
|
|
by deusu
3752 days ago
|
|
(I'm not with CommonSearch. I have my own project that crawls extensively though.) You do realize that you are talking about potentially a LOT of data? To give you an example: The word "work" occurs on about 4% of all web-pages. So even if there were only about 2bn pages in an index, that would mean 80 million matching pages. Even if you only need their URLs that would be about 2.4gb of data assuming an average URL length of 30 bytes. Ok, compression can make that smaller, but still... It would also mean that the server would need to make 80 million random reads to get the URLs. Even with SSDs that would take some time. Hmm, actually in this case it may be faster to just read all URL-data sequentially, than doing random reads. But in both cases we would be talking about minutes needed to get all that data from disk. I currently have a search-index with about 1.2bn pages - I expect to reach 2bn pages by mid-May - that could be used to get the kind of data you need. But not in a realtime API. Not that amount of result-data. |
|