| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by saalweachter 2571 days ago
	A Bloom filter is just way overkill. If you have a list of 20 trillion query strings, and each query string is on average < 100 bytes, you're looking at a three line MapReduce and < 1 PiB of disk to create a table which has the frequency of every query ever issued. Add a counter to your final reduce to count how often the # times seen is 1.

1 comments

sloppycee 2571 days ago

uh, is this sarcasm?

A bloom filter is the most appropriate data structure for this use-case. How is it overkill when it uses less space and is faster to query?

link

Strilanc 2570 days ago

Actually the bloom filter was just an approachable example. There are much more clever and space efficient solutions to this problem, such as HyperLogLog [1] (speculating purely based on the numbers in that article, it looks like a few megabytes of space would be far more than sufficient). See the Wikipedia page on the "Count-distinct problem" [2].

1: https://en.wikipedia.org/wiki/HyperLogLog 2: https://en.wikipedia.org/wiki/Count-distinct_problem

link

saalweachter 2568 days ago

My initial approach was also technically wrong; it tells you the fraction of queries which happen once.

To find the fraction of queries each day which are new, you would want to add a second field to your aggregation (or just change the count), the first date the query was seen. After you get the first date each query was seen, sum up the total number of queries first seen on each date, compare it to the traffic for each date.

You could still hand the problem to a new hire (with the appropriate logs access), expect them to code up the MapReduce before lunch (or after if they need to read all the documentation), farm out the job to a few thousand workers, and expect to have the answer when you come back from lunch.

link