| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vinnyglennon 5058 days ago
	Vinny Glennon, One of the founders here. Thanks very much for the up votes. The Chrome extension does not work in private browsing. I have a set of porn sites(1.7 million stored in redis) that I check if incoming links are a member of. You can selectively block sites ( https://www.seenbefore.com/blacklist_items).

2 comments

andrewthornton 5058 days ago

Where did you get your list from, and can you share it?

link

user24 5058 days ago

For research purposes.

link

jparishy 5057 days ago

For science.

link

krassif 5054 days ago

Perhaps http://www.shallalist.de/ ?

link

gingerjoos 5058 days ago

Do you store the entire set of 1.7 million entries in redis? Or is redis an index to data stored elsewhere, in a relational DB perhaps?

I was under the impression that redis wouldn't be all that useful to store a lot of data. Would be great if something as quick as redis could work with large data sets.

link

vinnyglennon 5058 days ago

Storing a list of 1.7 million strings for us takes 70mb stored in memory. Testing for membership is an O(1) op. Very happy with it. We use mongo as a dumb data store as well as a bunch of other infrastructure tools, like http://circleci.com we could have only dreamt of years ago.

link

genwin 5057 days ago

> Testing for membership is an O(1) op

Curious how. O(1) an array index lookup, not a string lookup, I thought.

link

milkshakes 5057 days ago

http://redis.io/commands/sismember

link

genwin 5057 days ago

Again just curious, but I'd still like to know how. Someone there asks the mod how it could be O(1), the mod replies it's a "hash table lookup". But Wikipedia at http://en.wikipedia.org/wiki/Big_O_notation suggests that such lookup is no faster than O(log log n). I think the redis info is incorrect.

O(1) implies that the location of the member in the list is already known, with no search required. I don't see how that could be the case when it's a key lookup. The key could be anywhere in the list, even if the list is sorted. They key would have to be searched for, it seems.

link

milkshakes 5057 days ago

O(1) means constant time. Redis sets are interesting, because there are a few possible implementations under the hood, but the typical case winds up implementing it as a hash table. Hash tables have constant time lookup.

Think of it this way (this isn't literally what happens, but it's close).

1) Take the url you're looking for. Run it through a hash function. This takes an (amortized) constant amount of time.

2) Now you have an index to check. (the return value from the hash function). So index into your actual table, and check to see what's stored there. If there's a value stored there, then the url is a member of the set. This also takes a constant amount of time.

Does this help?

link

StavrosK 5057 days ago

Why not use a bloom filter?

link

vidarh 5057 days ago

The main benefit of Bloom filters is that they can be made small. Given that his database takes only 70MB or so and he's not trying to ship this to devices that might have much in terms of space limitations, there would appear to be little point.

link

StavrosK 5057 days ago

Eh, true, I guess redis is sufficiently awesome.

link

genwin 5057 days ago

Maybe because of this fact (according to Wikipedia)?: "The more elements that are added to the set, the larger the probability of false positives."

link

StavrosK 5057 days ago

That depends on its size, though. You can make it larger and get fewer false positives.

link