An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.
I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:
from random import random
def estimate_uniques(iterable, window_size=100):
p = 1
seen = set()
for i in iterable:
if i not in seen:
seen.add(i)
if random() > p:
seen.remove(i)
if len(seen) >= window_size:
seen = {s for s in seen if random() < 0.5}
p /= 2
return int(len(seen) / p)
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.
My point is that there is a difference between a Python function's returning false and the function's raising an error, and sometimes the difference really matters, so it would be regrettable if logic teachers actually did use ⊥ to mean false because programming-language theorists use it to mean something whose only reasonable translation in the domain of practical programming is to raise an error.
I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.