Hacker News new | ask | show | jobs
by matt4711 3213 days ago
I was at the SIGIR'17 presentation of this paper (won best paper award btw) and have some comments in general:

- They mentioned (from what I remember) that they now use BitFunnel as they core of the complete Bing search engine not just the fresh parts.

- When I read the paper and looking at the code, it looks like their index doesn't include frequency information whereas your PEF code does. It is unclear what was counted in the experiments.

- If you look at the code, they are actually doing much more complicated stuff than just regular bloom filters by "bin packing" the hash positions for each term to reduce false positive rates (see https://github.com/BitFunnel/BitFunnel/issues/278 ). I'm nor sure if it is "fair" to compare a system developed by 10+ engineers over many years to a "phd student" code base developed over short period of time. I think the PEF code is excellent but I'm more talking about that engineering efforts can have a large impact on performance.

- I'm fairly sure you are right regarding the lack of URL-sorting. However, this can have another cause. If you consider Figure 4 in the paper which shows how "higher ranking rows" group documents together to allow faster intersection. URL sorting causes clusters in document-ids. Say, in the example in Fig. 4 there might be a cluster for that specific term for documents 0,1,2,3. This would mean the "higher ranking" row approach becomes worse (more false positives) when clustering occurs in the collection. So while URL-sorting helps PEF, it will most likely make BitFunnel worse.

1 comments

> They mentioned (from what I remember) that they now use BitFunnel as they core of the complete Bing search engine not just the fresh parts.

I find it hard to believe this. Their main index is certainly not all-RAM (there must be some flash and maybe even disk), and the throughput would just not be enough for something like BitFunnel.

> When I read the paper and looking at the code, it looks like their index doesn't include frequency information whereas your PEF code does. It is unclear what was counted in the experiments.

In PEF the frequencies are not interleaved with the postings, so if you don't read them you don't pay any computational overhead (they mention this in the paper). However, it's not clear whether they included them when measuring the space.

> I'm nor sure if it is "fair" to compare a system developed by 10+ engineers over many years to a "phd student" code base developed over short period of time.

I'm not trying to compare the code :) On the contrary, I'm mostly concerned about the behavior as the collection size grows. Gov2, especially if split into 5 pieces, is relatively small.

> So while URL-sorting helps PEF, it will most likely make BitFunnel worse.

That's possible, but I don't see why they could not use different docid orderings for BitFunnel and PEF. If they use the one that is better for BitFunnel, that's not fair to PEF.

> I find it hard to believe this. Their main index is certainly not all-RAM (there must be some flash and maybe even disk), and the throughput would just not be enough for something like BitFunnel.

From looking at the github repo it does look like the system runs entirely in main memory.

Yes, I meant that I don't think they're holding an entire web index in RAM.