Hacker News new | ask | show | jobs
by nialo 3330 days ago
They solve different problems. In particular, ripgrep and friends are designed to search arbitrary and potentially large directories with no pre-computation. They run in time ~linear in the size of the files to be searched.

The paper under discussion here is about a new way to create an index that also takes time ~linear in the size of the files to be searched, although presumably with a higher constant factor than just searching those files. After you have the index it's possible to search it in time linear in the length of the query rather than the files. This is much faster, but requires storing an index that is at least as large as the original file set, and keeping it up to date as things are changed etc.

1 comments

Can you retrieve the original file from the index? If yes it's maybe interesting to store files like that for some databases by default and retrieve the original file only when needed.
I haven't read this paper, and haven't worked with the compressed variety of suffix trees/arrays. That said, I'm confident it's possible to retrieve the original file from a normal suffix tree, although it would be pretty slow. I imagine it must be possible to retrieve the file from the compressed version as well.

If nothing else, I think it is possible to retrieve a complete file from any index that lets you search for substrings and get a full string and position in the file back. Just search for all length 1 substrings, get a map from position -> string, then reconstruct the file from that.

I doubt it's worth storing files this way, because turning the index back into the file sounds very slow. I'd rather just store the file and the index, the time v. space tradeoff seems like a good one if you really care about search performance. That said, I use The Silver Searcher, and it's fast enough with no index that I don't think any of this stuff is worth the effort for searching text files on a file system.