Hacker News new | ask | show | jobs
by chupy 3953 days ago
Hello Jeffrey, First I wanted to say that your post is very nicely written and full of juicy details! :)

Regarding the sets database, I had to solve quite a similar problem at the company where I work and instead of sets I actually chose to use the Redis HypeLogLog structure instead of sets because for near real time results you just need an approximate count of the sets / or their intersection and you don't need to know the specific set members. I just wanted to let you know that it works great for us for with doing intersections (PFMERGE) on sets containing hundreds of millions of members. If anybody is interested I can do a writeup about it.

Did you ever consider using that?

2 comments

Thanks! We have considered using HLL, and it's a pretty cool algorithm.

For us, however, it's important to get the set members at the end of the day. Amplitude is unique from other analytics products in that we put a lot of emphasis on the actual users that correspond to a data point on a graph -- one of our key features, Microscope, is the ability to view those users, see more context around the events they are performing, and potentially create a dynamic cohort out of them. As such, approximations that don't allow us to get the set members don't quite satisfy our use case.

Sorry, I wanted to say that I am not actually not familiar with your product and was not aware of the feature in which you can create audiences for running ad campaigns. From the article I thought the sets were used mostly for real time analytics and that is why I started to talk about HLL.

If you do need the actual set members in real time then of course you can't use HLL :)

Hyperloglog is approximate. You also can't do set complement. Also can't get the ids. But other than that it's great!
I am not gonna suggest that one solution fits all. Maybe you have different requirements but in our case (adtech) offering near real time reporting with a standard error rate of 0.81% is very good.