Hacker News new | ask | show | jobs
by atombender 3261 days ago
Last I checked, Lucene doesn't use B-trees. It uses a variation on skip lists to implement efficient inverted indices.

As I understand it, its performance comes from a column-store-like approach where large document lists can be efficiently intersected. Lucene also uses a LSM-style approach when writing index files; instead of updating a single large index in place, it incrementally appends to new partial index files, and to execute a query it queries every partial index and merges the results. (These partial index files are then incrementally merged in the background to reduce the number of files that need to be aggregated at query time.)

Lucene is slower than an RDBMS on narrowly targeted queries where an RDBMS can use a B-tree, but is more efficient when a query matches many documents because its index data structures are more efficient than a classical "sequential scan" across row pages.

1 comments

Not sure I understand your reply - which of it explains how NitroGen (i.e. the indexing algorithm described in the linked paper) has evolved?
I was responding to the assertion that Lucene uses B-trees: "The precise reason for using b-tree(-ish) data structures in databases and search engines like Lucene is..."
Oh, that's why I had mentioned tree-ish structures, sort of referring to skip-lists & friends on a broad level. Sorry for the misleading expression. Rather, I'm interested if any ranged/set retrieval approach can be found that scales better than log n. (n.b. ranged, so LSH doesn't count either ;-))