Hacker News new | ask | show | jobs
by inciampati 1524 days ago
I doubt that a succinct full text index (like an FM-index) of this data would require more than a modest server to keep in memory. Why aren't these used in this context?
1 comments

But we're talking petabytes of text comments and an index of that would be a lot larger. How do you access that data fast enough to enable search?
Succinct full text indexes can be substantially smaller than the source text. It depends on the zero order entropy of the text. If things are highly repetitive, a very small index might be feasible. Usually lookup times are linearly proportional to query size, with logarithmic factors in database size.
I've yet to see such a system (in production) except for Sonic, but sonic doesn't allow for full-text search only search on a key-by-key basis.