Hacker News new | ask | show | jobs
by paladin314159 3954 days ago
Redis is a great piece of software, and we leverage it for several uses cases outside of managing sets. For our use case, there were a couple of blockers that prevented Redis from being a viable solution:

1. It's tricky to scale out a Redis node when it gets too big. Because RDB files are just a single dump of all data, it's not easy to make a specific partitioning of the dataset. This was a very important requirement for us in order to ease scaling (redis-cluster wasn't ready yet -- we've been following that carefully).

2. When you store hundreds of GB of persistent data in Redis, the startup process can be very slow (restoring from RDB/AOF). Since it can't serve reads or writes during this time, you're unavailable (setting up a slave worsens the following problem).

3. The per-key overhead in Redis (http://stackoverflow.com/questions/10004565/redis-10x-more-m...). We have many billions of sets that are often only a few elements in size -- think of slicing data by city or device type -- which means that the resulting overhead can be larger than the dataset itself.

If you think about these problems upfront, they're not too difficult to solve for a specific use case (partition data on disk, allow reads from disk on startup), but Redis has to be generic and so can't leverage the optimizations we made.

1 comments

Hello Jeffrey, First I wanted to say that your post is very nicely written and full of juicy details! :)

Regarding the sets database, I had to solve quite a similar problem at the company where I work and instead of sets I actually chose to use the Redis HypeLogLog structure instead of sets because for near real time results you just need an approximate count of the sets / or their intersection and you don't need to know the specific set members. I just wanted to let you know that it works great for us for with doing intersections (PFMERGE) on sets containing hundreds of millions of members. If anybody is interested I can do a writeup about it.

Did you ever consider using that?

Thanks! We have considered using HLL, and it's a pretty cool algorithm.

For us, however, it's important to get the set members at the end of the day. Amplitude is unique from other analytics products in that we put a lot of emphasis on the actual users that correspond to a data point on a graph -- one of our key features, Microscope, is the ability to view those users, see more context around the events they are performing, and potentially create a dynamic cohort out of them. As such, approximations that don't allow us to get the set members don't quite satisfy our use case.

Sorry, I wanted to say that I am not actually not familiar with your product and was not aware of the feature in which you can create audiences for running ad campaigns. From the article I thought the sets were used mostly for real time analytics and that is why I started to talk about HLL.

If you do need the actual set members in real time then of course you can't use HLL :)

Hyperloglog is approximate. You also can't do set complement. Also can't get the ids. But other than that it's great!
I am not gonna suggest that one solution fits all. Maybe you have different requirements but in our case (adtech) offering near real time reporting with a standard error rate of 0.81% is very good.