Hacker News new | ask | show | jobs
by eserorg 6069 days ago
You're absolutely right.

I hacked up a prototype last night that creates a reverse index of:

tags => item GUID's => a precomputed bloom filter for all tags in that GUID

Each tag has its own reverse-index, which is stored in a separate file assembled using O_APPEND. The entries in each file are pre-sorted based on a predefined ranking algorithm.

Queries are mapped to cluster nodes using a CRC32 hash modulo across the number of cluster nodes.

Tag conjunction queries are run by computing a bloom filter for the tags in the query. The tag with the fewest GUID's is determined (approximately) through a LUT stored in main-memory. The query bloom filter is then sequentially compared using a bitwise-and with each of the bloom filters in the appropriate reverse-index file.

After 10 matches are found, the results are rendered.

No check is done for false positives.

The problem I'm having now is in prepending new entries to the start of a reverse-index file. This operation is not performant. Since it's out-of-band with the query stream, it's tolerable for the time being.