Hacker News new | ask | show | jobs
by vinnyglennon 5011 days ago
Vinny Glennon, One of the founders here. Thanks very much for the up votes. The Chrome extension does not work in private browsing. I have a set of porn sites(1.7 million stored in redis) that I check if incoming links are a member of. You can selectively block sites ( https://www.seenbefore.com/blacklist_items).
2 comments

Where did you get your list from, and can you share it?
For research purposes.
For science.
Do you store the entire set of 1.7 million entries in redis? Or is redis an index to data stored elsewhere, in a relational DB perhaps?

I was under the impression that redis wouldn't be all that useful to store a lot of data. Would be great if something as quick as redis could work with large data sets.

Storing a list of 1.7 million strings for us takes 70mb stored in memory. Testing for membership is an O(1) op. Very happy with it. We use mongo as a dumb data store as well as a bunch of other infrastructure tools, like http://circleci.com we could have only dreamt of years ago.
> Testing for membership is an O(1) op

Curious how. O(1) an array index lookup, not a string lookup, I thought.

Again just curious, but I'd still like to know how. Someone there asks the mod how it could be O(1), the mod replies it's a "hash table lookup". But Wikipedia at http://en.wikipedia.org/wiki/Big_O_notation suggests that such lookup is no faster than O(log log n). I think the redis info is incorrect.

O(1) implies that the location of the member in the list is already known, with no search required. I don't see how that could be the case when it's a key lookup. The key could be anywhere in the list, even if the list is sorted. They key would have to be searched for, it seems.

O(1) means constant time. Redis sets are interesting, because there are a few possible implementations under the hood, but the typical case winds up implementing it as a hash table. Hash tables have constant time lookup.

Think of it this way (this isn't literally what happens, but it's close).

1) Take the url you're looking for. Run it through a hash function. This takes an (amortized) constant amount of time.

2) Now you have an index to check. (the return value from the hash function). So index into your actual table, and check to see what's stored there. If there's a value stored there, then the url is a member of the set. This also takes a constant amount of time.

Does this help?

Why not use a bloom filter?
The main benefit of Bloom filters is that they can be made small. Given that his database takes only 70MB or so and he's not trying to ship this to devices that might have much in terms of space limitations, there would appear to be little point.
Eh, true, I guess redis is sufficiently awesome.
Maybe because of this fact (according to Wikipedia)?: "The more elements that are added to the set, the larger the probability of false positives."
That depends on its size, though. You can make it larger and get fewer false positives.