| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by atombender 3308 days ago

Last I checked, Lucene doesn't use B-trees. It uses a variation on skip lists to implement efficient inverted indices.

As I understand it, its performance comes from a column-store-like approach where large document lists can be efficiently intersected. Lucene also uses a LSM-style approach when writing index files; instead of updating a single large index in place, it incrementally appends to new partial index files, and to execute a query it queries every partial index and merges the results. (These partial index files are then incrementally merged in the background to reduce the number of files that need to be aggregated at query time.)

Lucene is slower than an RDBMS on narrowly targeted queries where an RDBMS can use a B-tree, but is more efficient when a query matches many documents because its index data structures are more efficient than a classical "sequential scan" across row pages.

1 comments

fnl 3308 days ago

Not sure I understand your reply - which of it explains how NitroGen (i.e. the indexing algorithm described in the linked paper) has evolved?

link

lobster_johnson 3308 days ago

I was responding to the assertion that Lucene uses B-trees: "The precise reason for using b-tree(-ish) data structures in databases and search engines like Lucene is..."

link

fnl 3308 days ago

Oh, that's why I had mentioned tree-ish structures, sort of referring to skip-lists & friends on a broad level. Sorry for the misleading expression. Rather, I'm interested if any ranged/set retrieval approach can be found that scales better than log n. (n.b. ranged, so LSH doesn't count either ;-))

link