Hacker News new | ask | show | jobs
by compare 4438 days ago
I'm about to deploy a new autocomplete on my site, probably with 10s to 100s of millions of records, at least 100 users at once. Would the new Redis commands help here? How would the memory usage be? Is it better to just use something else like Cleo for autocomplete?
2 comments

Can't reply on the comparison with other products since I myself have still to compare and build an experience about this, but as far as how Redis performs, there is a demo here: http://autocomplete.redis.io. Basically for 8 million records it takes 1GB of memory (32 bit system), however here the records are source code lines so the average length is bigger than the usual search-term length. Definitely no problems in the ~10 millions range even with just a few GBs of memory. For +100 millions you need to either split the range across servers or use a machine with some non-trivial amount of memory, like 16-32 GB or alike.
Also with that much data is sensible to set up redis in a master-slave fashion for better availability. Once you have that up and running you can split the work between the two, since autocomplete searches are intrinsically read operations.
Keep in mind that the current redis implemention doesn't include scores for auto completion, and assumes all the entries in a sorted set have the same score to be able to do autocomplete style search. How to sort is up to you but it might create big overhead in CPU and memory.
At the end of the original link of this post, I tried to hint a bit about a simple way to overcome this, which is by discrete splitting of different rates different search strings are typed.

Basically you have a sorted set that counts the occurrences of a given search string, using top-k stream algorithm consisting of just taking a set of size N2 if you are interested in the top-N, and always trimming a random element from the top N - N2 interval when a new element is added.

At this point what happens is that you take M autocompletion keys. complete_10, complete_100, complete_1000, ..., complete_max, that go up to order of magnitudes, so what happens is that when an user types you query complete_max, and check if it has enough entries, otherwise you also query the lower-rank sorted set and so forth.

This way you are able to show users top-searches while they are typing, but still populate the autocompletion with lower-frequency entries if needed.

Of course when the "counting" sorted set is updated, if an element goes from an order of magnitude to the other, you update the sorted sets accordingly.

Yep, that's a solid approach, although it might be limiting to some people. Also, maybe having something like ZRANGEBYLEXSTORE would also allow caching and sorting of the results by allowing quick ZINTER etc.
Agreed on both, the algorithm proposed is not perfect as it is discrete in nature, and storing results makes a lot of sense for top-queries since this is the kind of application where 5 minutes old is "good enough" for most users.