| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mutex007 3613 days ago
	I am curious how an inverted index will be implemented efficiently atop an LSM tree. Maybe batch parts of the posting list under a numeric key and when performing an intersection operation between two lists, you interate using a range scan. Could work but i wonder how fast Read operations will be. Anybody attempted yet?

3 comments

btrask 3613 days ago

LSM trees are ideal for posting lists, because every time you add a new document, you need to touch potentially hundreds of different lists. LSM trees are able to batch all of these writes together. That's why Google originally developed BigTable, AFAIK.

link

teraflop 3613 days ago

In fact, Lucene (the library behind Solr and Elasticsearch) is basically a special-purpose inverted-index-as-LSM-tree database engine.

link

abecedarius 3613 days ago

https://github.com/kragen/500lines/tree/master/search-engine explains a simple version of this in Python.

link

Arkanosis 3613 days ago

Done this years ago, still being used in production. In addition to the inverted lists, we have a bitmap per level, which tells whether an id is present in that level. All bitmaps are merged in memory in a structure mapping each id to the last level it is valid in. Then, lists of different levels are merged using a binary min heap. Entries that are not valid in the map are kept off the merge heap.

link