| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jltsiren 1641 days ago

There was some interest in the information retrieval research community 10-15 years ago, but I don't think anyone ever found a good application for it. Some limitations of the BWT always got in the way.

The BWT sees strings as integer sequences. Either "ABC" and "abc" are two unrelated strings, or you normalize before building the index and lose the ability to distinguish between the two.

Search proceeds character-by-character backwards, jumping arbitrarily around the BWT using the same LF-mapping function as when inverting the BWT. You get cache misses for every character.

BWT construction is expensive, because you want a single BWT for the entire string collection. There is a ridiculous number of papers on BWT construction, as well as on updating and merging existing BWTs, but the problem has still not been solved adequately. If your data is measured in gigabytes, you can just pay the price and build the index, but a few terabytes seems to be the practical upper limit for the current approaches.

You can of course partition the data and build multiple indexes, but then you have to search for each pattern in each index. There is no way to partition the data in a way that different indexes would be responsible for different queries.

1 comments

twotwotwo 1641 days ago

All interesting! Thank you.

link