Hacker News new | ask | show | jobs
by eserorg 6070 days ago
That would be very interesting to read.

We're looking at implementing a tagging system for navigating through a large proprietary datastore of oil and gas well data.

The problem we're running into is scaling conjunction queries without blowing-out our hardware budget -- 100's of tags per item with tens of millions of items, updated daily.

2 comments

You may try what happens with Redis server-side Set intersections. This could help in theory, at least if you have a way to sort the output data (that will be unsorted).

Basically you need a lot of RAM, and to put in keys called something like "tag:<tagid>" all the IDs of your products having such an ID.

Then if you want all the products IDs having as tags both foo, and bar, you ask Redis the following: SINTER tag:10 tag:20

Using SINTERSTORE is also possible to put the result into a new key, and then use SORT to sort this data.

The right data structure for tagging is probably an inverted index. Look at Lucene, Solr, or Sphinx for this.
You're absolutely right.

I hacked up a prototype last night that creates a reverse index of:

tags => item GUID's => a precomputed bloom filter for all tags in that GUID

Each tag has its own reverse-index, which is stored in a separate file assembled using O_APPEND. The entries in each file are pre-sorted based on a predefined ranking algorithm.

Queries are mapped to cluster nodes using a CRC32 hash modulo across the number of cluster nodes.

Tag conjunction queries are run by computing a bloom filter for the tags in the query. The tag with the fewest GUID's is determined (approximately) through a LUT stored in main-memory. The query bloom filter is then sequentially compared using a bitwise-and with each of the bloom filters in the appropriate reverse-index file.

After 10 matches are found, the results are rendered.

No check is done for false positives.

The problem I'm having now is in prepending new entries to the start of a reverse-index file. This operation is not performant. Since it's out-of-band with the query stream, it's tolerable for the time being.