Hacker News new | ask | show | jobs
by mutex007 3613 days ago
I am curious how an inverted index will be implemented efficiently atop an LSM tree. Maybe batch parts of the posting list under a numeric key and when performing an intersection operation between two lists, you interate using a range scan. Could work but i wonder how fast Read operations will be. Anybody attempted yet?
3 comments

LSM trees are ideal for posting lists, because every time you add a new document, you need to touch potentially hundreds of different lists. LSM trees are able to batch all of these writes together. That's why Google originally developed BigTable, AFAIK.
In fact, Lucene (the library behind Solr and Elasticsearch) is basically a special-purpose inverted-index-as-LSM-tree database engine.
https://github.com/kragen/500lines/tree/master/search-engine explains a simple version of this in Python.
Done this years ago, still being used in production. In addition to the inverted lists, we have a bitmap per level, which tells whether an id is present in that level. All bitmaps are merged in memory in a structure mapping each id to the last level it is valid in. Then, lists of different levels are merged using a binary min heap. Entries that are not valid in the map are kept off the merge heap.