| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jallmann 3954 days ago
	What shortcomings of Redis set operations does the in-memory data store address, and how? Unrelated rant: regardless of its merits, "Lambda" Architecture is probably the most annoying overloaded term in use today, second only to "Isomorphic" Javascript. Just because something has a passing resemblance to the functional style doesn't grant license to re-appropriate a well understood term of art.

2 comments

paladin314159 3954 days ago

Redis is a great piece of software, and we leverage it for several uses cases outside of managing sets. For our use case, there were a couple of blockers that prevented Redis from being a viable solution:

1. It's tricky to scale out a Redis node when it gets too big. Because RDB files are just a single dump of all data, it's not easy to make a specific partitioning of the dataset. This was a very important requirement for us in order to ease scaling (redis-cluster wasn't ready yet -- we've been following that carefully).

2. When you store hundreds of GB of persistent data in Redis, the startup process can be very slow (restoring from RDB/AOF). Since it can't serve reads or writes during this time, you're unavailable (setting up a slave worsens the following problem).

3. The per-key overhead in Redis (http://stackoverflow.com/questions/10004565/redis-10x-more-m...). We have many billions of sets that are often only a few elements in size -- think of slicing data by city or device type -- which means that the resulting overhead can be larger than the dataset itself.

If you think about these problems upfront, they're not too difficult to solve for a specific use case (partition data on disk, allow reads from disk on startup), but Redis has to be generic and so can't leverage the optimizations we made.

link

chupy 3953 days ago

Hello Jeffrey, First I wanted to say that your post is very nicely written and full of juicy details! :)

Regarding the sets database, I had to solve quite a similar problem at the company where I work and instead of sets I actually chose to use the Redis HypeLogLog structure instead of sets because for near real time results you just need an approximate count of the sets / or their intersection and you don't need to know the specific set members. I just wanted to let you know that it works great for us for with doing intersections (PFMERGE) on sets containing hundreds of millions of members. If anybody is interested I can do a writeup about it.

Did you ever consider using that?

link

paladin314159 3953 days ago

Thanks! We have considered using HLL, and it's a pretty cool algorithm.

For us, however, it's important to get the set members at the end of the day. Amplitude is unique from other analytics products in that we put a lot of emphasis on the actual users that correspond to a data point on a graph -- one of our key features, Microscope, is the ability to view those users, see more context around the events they are performing, and potentially create a dynamic cohort out of them. As such, approximations that don't allow us to get the set members don't quite satisfy our use case.

link

chupy 3953 days ago

Sorry, I wanted to say that I am not actually not familiar with your product and was not aware of the feature in which you can create audiences for running ad campaigns. From the article I thought the sets were used mostly for real time analytics and that is why I started to talk about HLL.

If you do need the actual set members in real time then of course you can't use HLL :)

link

ddorian43 3953 days ago

Hyperloglog is approximate. You also can't do set complement. Also can't get the ids. But other than that it's great!

link

chupy 3953 days ago

I am not gonna suggest that one solution fits all. Maybe you have different requirements but in our case (adtech) offering near real time reporting with a standard error rate of 0.81% is very good.

link

daddykotex 3954 days ago

I also wonder what sacrifices were made to the design used in Redis so that it is able to handle better performance.

link