| HN Mirror

Oof, this may take a much longer blog post, but here is the very high level basic view.

The basic construction on one doc-sharded server looks like: 1) Maximum valid local_docid 2) A map of local_docid => state (valid, deleted) 3) A map of token_id (indexed term) => map of local_docids to positions in doc.

On document update, you increment the next local docid. You then rip through the doc and extract the tokens. For each token, you insert the docid,position into map (3). Then you add the document to map (2) with state "valid", and finally increment (1).

On query, you first copy (1), then do the typical AND/OR retrieval over (3). Any docids seen higher than (1) are ignored, and any docs retrieved are then filtered by (2).

In this model, (1) is a volatile memory access. (2) and (3) are very similar to this "relativistic hash map".

Deletions are complicated, and usually you filter out invalid docids from (3) as a background compaction process.