|
|
|
|
|
by gwillen
753 days ago
|
|
Merging like that doesn't work -- it will tend to overestimate the number of distinct elements. This is fairly easy to see, if you consider a stream with some N distinct elements, with the same elements in both the first and second halves of the stream. Then, supposing that p is 0.5, the first instance will result in a set with about N/2 of the elements, and the second instance will also. But they won't be the same set; on average their overlap will be about N/4. So when you combine them, you will have about 3N/4 elements in the resulting set, but with p still 0.5, so you will estimate 3N/2 instead of N for the final answer. I have a thought about how to fix this, but the error bounds end up very large, so I don't know that it's viable. |
|