|
|
|
|
|
by SeanSullivan86
324 days ago
|
|
I've sometimes been confused by the term "inverted index". The example in this post feels like what I would just call an "index"... i.e documents indexed by the words they contain. Feels about the same as the index in the back of a physical book. Is the distinction that an index on a multi-valued attribute is called an inverted index? |
|
But when we talk about inverted indexes, they are almost always term -> posting list, and most index data structures lay these out so that posting lists are sorted and compressed together. Traditional database indexes like B-trees are optimized for rapid insertion and deletion, while inverted indexes tend to be optimized for batch processing, because you typically deconstruct text into words for a large batch and then lazily integrate this batch into the main index.
Part of this is about scale; a row in a database typically has a single column or maybe 2-3 columns in a composite index; but a document text may tokenize into thousands, hundreds of thousands, or millions of words. At this scale, the fine-grained nature of words mean B-trees aren't as a good a fit.
Another part of it is that inverted indexes aren't for point queries, which is what B-trees are optimized for; you typically search for many words at a time in order to rank your search results by some function like cosine similarity. You rarely want a single posting; you want the union or intersection of many posting sorted by score.