| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lucasschm 3314 days ago
	You are correct, but HyperLogLog has many buckets counting the longest run of zeros in order to avoid the problem of outliers. I recently studied these probabilistic algorithms and did a notebook with code and plots to show their performance: https://github.com/lucasschmidtc/Probabilistic-Algorithms/bl...

2 comments

snowcrshd 3314 days ago

Thanks for sharing that!

Just skimmed through it and seems pretty interesting. I'll read it more in depth later.

link

lucasschm 3314 days ago

No problem. If there are mistakes or a segment is not clear, let me know

link

meetapoorvgupta 3314 days ago

Thanks for the write up, Lucas. It was very intuitive and I learnt a lot.

I noticed that you used 5000 buckets to store the frequency of 7000 non-unique words in the section on 'Counting Bloom Filters'. How is that better than using 7000 buckets and a uniformly distributed hash function, which would maintain frequencies perfectly? We would be using fewer buckets by an order of magnitude in a real-world implementation to save memory.

link

lucasschm 3313 days ago

Yeah, I should have given more thought to that number. Updated the example for N=300. Thanks

link

anhldbk 3314 days ago

Thanks for sharing guy! Interesting repo.

link