| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by csense 549 days ago

I'm trying to follow the example and getting lost.

> Imagine the scenario where "mary had" occurs in the following positions: [1, 6] and "a" appears in the position [2], so "mary had a" occurs in the positions [2]

Okay so basically this means "mary had a" ends at token position 2 (assuming "mary" is token position 0), and you're trying to create an efficient algorithm to do this backwards linking process.

It's not entirely clear from the article what's pre-computed ahead of time (when documents are added / indexed by the system) and what's done on-the-fly (in response to a user's specific search query).

Based on skimming the article it appears that you're doing this backwards linking process on the fly for a specific phrase the user enters (but I could be wrong about that).

> allows us to decompose this value

What value is being decomposed? It is not clear what "this value" refers to.

> one representing the group and the other the value

Ctrl+F is telling me that's the first occurrence of the word "group" in this post. What is the "group" being represented?

> pos = group * 16 + value

Okay so you're bit-packing two fields into pos. One of the things being packed is called "group," and it seems to be 16 bits. The other thing being packed is "value," and it seems to be 4 bits. So in total "pos" has 20 bits.

> the maximum document length is 1048576 tokens

It seems that "pos" simultaneously corresponds to a token position and our two-field bit-packed thing above. I can't figure out how these two things are in one-to-one correspondence.

I stopped reading there, given my confusion so far it seems unlikely I'll really be able to really understand much of what follows.

PS: I skimmed the article on roaring bitmaps. Seems they're basically bitmaps where each 1k (8192-bit) chunk has a storage format (dense, sparse, or RLE), and algorithms for quickly doing intersection, union, etc. with various optimized cases chosen roughly by the number of 1's. (Intersecting two dense bitmaps with say ~50% randomly distributed 1's you can't get faster than ANDing together the bit vectors. But if you're, say, intersecting a small sparse bitmap with ~5 1's against a big dense bitmap with ~4k 1's you can iterate over the 1's in the sparse bitmap checking whether each 1 is in the big bitmap.)

So in my mind I'm basically just blackboxing roaring bitmaps as "java.util.BitSet or vector<bool> with some extra optimizations that kick in if your data has sections where most of the bits are the same".