Hacker News new | ask | show | jobs
by ot 3213 days ago
> They mentioned (from what I remember) that they now use BitFunnel as they core of the complete Bing search engine not just the fresh parts.

I find it hard to believe this. Their main index is certainly not all-RAM (there must be some flash and maybe even disk), and the throughput would just not be enough for something like BitFunnel.

> When I read the paper and looking at the code, it looks like their index doesn't include frequency information whereas your PEF code does. It is unclear what was counted in the experiments.

In PEF the frequencies are not interleaved with the postings, so if you don't read them you don't pay any computational overhead (they mention this in the paper). However, it's not clear whether they included them when measuring the space.

> I'm nor sure if it is "fair" to compare a system developed by 10+ engineers over many years to a "phd student" code base developed over short period of time.

I'm not trying to compare the code :) On the contrary, I'm mostly concerned about the behavior as the collection size grows. Gov2, especially if split into 5 pieces, is relatively small.

> So while URL-sorting helps PEF, it will most likely make BitFunnel worse.

That's possible, but I don't see why they could not use different docid orderings for BitFunnel and PEF. If they use the one that is better for BitFunnel, that's not fair to PEF.

1 comments

> I find it hard to believe this. Their main index is certainly not all-RAM (there must be some flash and maybe even disk), and the throughput would just not be enough for something like BitFunnel.

From looking at the github repo it does look like the system runs entirely in main memory.

Yes, I meant that I don't think they're holding an entire web index in RAM.