Hacker News new | ask | show | jobs
by marginalia_nu 1416 days ago
Might just be to reduce the size of the index.

Not sure if they still do, but they used to map keywords to integer identifiers, instead of using the actual string value (string indices get very big). Page and Brin themselves explain it here[1]. I do the same in my search engine.

Problem is there are a lot of junk identifiers, so there's a point to reducing the scope by eliminating probable noise-keywords that are unlikely to ever be relevant to any search. UUIDs and hashes would probably fall into that scope, since they have a very large namespace that can very easily gunk up the lexicon with words that are never ever going to be relevant. You'd probably want to keep the word identifier 32 bits if you can get away with it, but maybe 64 bits for a global search engine like Google.

[1] http://infolab.stanford.edu/~backrub/google.html (section 4.2.4)

1 comments

I really enjoy reading your comments on HN. I found your search engine on Gemini then see you post about search engines every now and then.

You have a great ability to break things down in a way that makes sense.

Hey, thanks man.

I don't feel like I do a very good job at explaining things for the most part, but maybe that's not a very reliable indicator of whether what I write makes sense. :P