|
The trick is that the list of URLs for each word already is in the URLs file. The URLs in a range are sorted. A sorted list (or list-range) forms an implicit set-like data structure, where you can do binary searches to test for existence. Consider a words file with two words, "hello" and "world", corresponding to the ranges (0,3), (3,6). The URLs file contains URLs 1, 5, 7, 2, 5, 8. The first range corresponds to the URLs 1, 5, 7; and the second 2, 5, 8. If you search for hello world, it will first pick a range, the range for "hello", let's say (1,5,7); and then do binary searches in the second range -- the range corresponding to "world" -- (2,5,8) to find the overlap. This seems like it would be very slow, but since you can trivially find the size of the ranges, it's possible to always do them in an order of increasing range-sizes. 10 x log(100000) is a lot smaller than 100000 x log(10) |
Funny I also selected "hello" and "world" above! Xo
My system is also written in Java btw!
Here are example results of my word search:
http://root.rupy.se/node/data/word/four
http://root.rupy.se/node/data/word/only
etc.