Hacker News new | ask | show | jobs
by melq 759 days ago
Estimating the amount of unique elements in a set and counting the amount of unique elements in a set are very different things. Cool method, bad headline.
3 comments

They’re not very different things; the terms are used interchangeably in most contexts because in the real world all counting methods have some nonzero error rate.

We talk about ‘counting votes’ in elections, for example, yet when things are close we perform ‘recounts’ which we fully expect can produce slightly different numbers than the original count.

That means that vote counting is actually vote estimating, and recounting is just estimating with a tighter error bound.

I kind of think the mythology of the ‘countless stones’ (https://en.wikipedia.org/wiki/Countless_stones) is a sort of folk-reminder that you can never be too certain that you counted something right. Even something as big and solid and static as a standing stone.

The situations where counting is not estimating are limited to the mathematical, where you can assure yourself of exhaustively never missing any item or ever mistaking one thing’s identity for another’s.

> the terms are used interchangeably in most contexts

Counting and estimating are not used interchangeably in most contexts.

> because in the real world all counting methods have some nonzero error rate.

The possibility that the counting process may be defective does not make it an estimation.

> We talk about ‘counting votes’ in elections, for example, yet when things are close we perform ‘recounts’ which we fully expect can produce slightly different numbers than the original count.

We talk about counting votes in elections because votes are counted. The fact that the process isn't perfect is a defect; this does not make it estimation.

> That means that vote counting is actually vote estimating, and recounting is just estimating with a tighter error bound.

No. Exit polling is estimation. Vote counting is counting. Vote recounting is also counting, and does not necessarily impose a tighter error bound, nor necessarily derive a different number.

> The situations where counting is not estimating are limited to the mathematical, where you can assure yourself of exhaustively never missing any item or ever mistaking one thing’s identity for another’s.

So like, computers? Regardless, this is wrong. Estimating something and counting it are not the same thing. Estimation has uncertainty, counting may have error.

This is like saying addition estimates a sum because you might get it wrong. It's just not true.

So, IEEE floating point doesn’t support ‘addition’ then.
IEEE 754 defines an exact binary result for the addition of any two floats.

That this bit-identical result is not the same operation as addition of real numbers is irrelevant, because floats aren't reals.

f1 + f2 is not an estimation. Even treating it as an approximation will get you into trouble. It's not that either, it's a floating-point result, and algorithms making heavy use of floating point had better understand precisely what f1 + f2 is going to give you if they want to obtain maximum precision and accuracy.

Cool, so next time I have numbers that aren't reals to perform math on, I can use floats.
Or if you have numbers that aren't integers to perform math on, you can use integers.

It's not a new problem, and it isn't specific to floats. Computers do discrete math. Always have, always will.

Come on. There is a fundamental difference between trying to get an exactly answer and not trying to get an exactly correct answer.
It’s not a fundamental difference, it’s a fundamental constraint.

There are circumstances - and in real life those circumstances are very common - where you must accept that getting an exactly correct answer is not realistic. Yet nonetheless you want to ‘count’ things anyway.

We still call procedures for counting things under those circumstances ‘counting’.

The constraints on this problem (insufficient memory to remember all the unique items you encounter) are one such situation where even computerized counting isn’t going to be exact.

I agree with you, but we are talking theory here. The algorithm doesn't count, it estimates.

You can make an algorithm that counts, you can make an algorithm that estimates, this is the second.

Estimation is counting with error bars.

Frankly, most of what you consider counting in your comment needs error bars - ask anyone who operated an all-cash cash-register how frequently end-of-day reconciliation didn't match the actual cash in the drawer (to the nearest dollar.)

The following is a list from my personal experience - of presumably precisely countable things that didn't turn out to be the case: the number of computers owned by an fairly large regional business, the number of (virtual) servers operated by a moderately sized team, the number of batteries sold in a financial year by a battery company.

Counting is a subset of estimation, not a synonym.

If I estimated the number of quarters in a stack by weighing them, that would be different from estimating the number of quaters in a stack by counting them. Both methods of estimation have error bars.

The list you provide is of categories that don't have clear definitions. If you have a sufficiently clear definition for a category given your population, it has a precise count (though your counting methodologies will still be estimates.) If your definition is too fuzzy, then you don't actually have a countable set.

I think I get your point completely, yet I'm not getting through.

Would you agree that 1+1=2? Or that pi is 3.14159...? These are mathematical truths, but quickly crumble in the real world. One apple plus one apple doesn't just equate to double the apple, no two apples are ever the same to begin with, there are no real perfect circles out there either, there is still value to those mathematical truths in that they make it evident that they are perfectly precise and that it is real world interaction which may bring error into the table.

Counting and estimation are different by definition. One is a full enumeration, the latter an extrapolation from sampled data. In both cases 'accuracy' is a factor. Even if we are counting the number of stars, it is still a difference of technique compared to estimating the number if stars.

I could try to count fibers in muscle or grains of sand in the beach, chances are accuracy would be low. One can be smart about technique for more accurate counts, eg: get 10M sand counters and give them each 1kg of sand which they then count the grains with tweezer and microscope. That is counting. At the same time, we could find an average count of grains in 1kg sand from a random 100 of our counters, and then estimate what an expected total would be. The estimate can be used to confirm the accuracy of the counts.

They are not really as far apart a you think. At small numbers, yes get are distinct. At large enough numbers, they in all practicality the same thing. E.g what’s the population of the US
And by that definition this is a counting algorithm.
True - for (relatively) small numbers. For large (huge) numbers estimation is usually considered to be equivalent to counting, and the result is sometimes represented using the "scientific" notation (i.e. "floating-point") rather than as an integer. For example, the mole is an integer whose value is only known approximately (and no one cares about the exact value anyway).
As of May 2019, the mole has an exact value, and Carbon-12's molar mass is the empirically-determined value.
This doesn't justify estimation to be equivalent to counting even if some mathematicians consider them to be the same. Floating points are for estimation. Integers are for counting. The two are not the same, not even for large numbers.
"Equivalent" and "the same" are sometimes equivalent. (Or the same.)

It depends on what the meaning of the word 'is' is.

https://libquotes.com/bill-clinton/quote/lby0h7o

It's an approximation, not an estimation.
Actually, my understanding is that it is an estimation because in the given context we don't know or cannot compute the true answer due to some kind of constraint (here memory or the size of |X|). An approximation is when we use a simplified or rounded version of an exact number that we actually know.
Wikipedia is on your side:

"In mathematics, approximation describes the process of finding estimates in the form of upper or lower bounds for a quantity that cannot readily be evaluated precisely"

This process doesn't use upper and lower bounds.

However, it still seems more like approximation than estimation to me because of this:

“Of course,” Variyam said, “if the [memory] is so big that it fits all the words, then we can get 100% accuracy.

It seems that in estimation the answer should be unknowable without additional information, whereas in this case it's just a matter of resolution or granularity because of the memory size.

Anyhoo ...

EDIT: also the paper says "estimate" and the article says both "approximate" and "estimate" at different times so it seems everyone except me thinks it's either an estimation or that estimation and approximation are interchangeable.

Still very different things, no?
It's the same thing at different degrees of accuracy. The goal is the same.
Still, counting things and counting unique things are two different procedures.
For someone who's pretty well-versed in English, but not a math-oriented computer scientist, this seems like a distinction without a difference. Please remedy my ignorance.
My GP was wrong, but the words are different.

Eatimation is a procedure the generates an estimate, which is a kind of approximation, while approximation is a result value. They are different "types", as a computer scientist would say. An approximation is any value that is justifiably considered to be nearly exact. ("prox" means "near". See also "proximate" and "proxy".)

Estimation is one way to generate an approximation. An estimate is a subtype of an approximation. There are non-estimation ways to generate an approximation. For example, take an exact value and round it to the nearest multiple of 100. That generates an approximation, but does not use estimation.

I’m not sure the linguistic differences here are as cut and dried as you would like them to be. Estimate and approximate are both verbs, so you can derive nouns from them both for the process of doing the thing, and for the thing that results from such a process.

Estimation is the process of estimating. It produces an estimate.

Approximation is the process of approximating. It produces an approximation.

You can also derive adjectives from the verbs as well.

An estimate is an estimated value.

An approximation is an approximate value.

But you’re right that the ‘approximate’ terms make claims about the result - that it is in some way near to the correct value - while the ‘estimate’ derived terms all make a claim about the process that produced the result (ie that it was based on data that is known to be incomplete, uncertain, or approximate)

The authors of the article disagree with you.
The authors of the paper disagree with me, the authors of the article don't (they use both approximate and estimate, but the paper does say estimate).