Hacker News new | ask | show | jobs
by m12k 1833 days ago
In my statistics class we were taught a technique for estimating the number of flies in a barn. You capture a fixed size sample of flies, e.g. 100, and give them each a little white dot on the back, then release them all back in the barn. Leave them to mingle for a while, then capture another fixed number of flies, e.g. 100. If 5 of them have a white dot on them, then you estimate that the 100 flies you captured originally and marked make up 5% of the total population, meaning your estimate of the total population must be 2000 flies.
8 comments

I found two flies in my house, marked them, then later found two more flies without marks. I think this means the population of flies in my house is infinite.
My condolences, that doesn't sound very nice. Unless you're a frog, in which case, congratulations!
Dear m12k,

I have a valid reason to outlaw humour.

Because of you comment 3 things have happened.

1) I snorted, causing my hot tea to use my nostrils as an emergency exit.

2) My nose became a tea tap.

3) My screen and keyboard, bless them, are covered with tea. My shame in cleaning it up is indescribable.

Bravo.

You want to consider, based on a distribution of potential fly populations, what the odds are that sampling two of them will fail to capture two specific ones. For example, you can state with certainty that the population is above 3.

Closely related (though not quite the identical problem): https://www.johndcook.com/blog/2010/03/30/statistical-rule-o...

Applying that directly, we would estimate that the odds of a fly in your house being marked are less than 150%, which tells us that you should be taking larger samples.

(With only two marked flies, this estimate will always be less informative than the fact that you sampled n flies -- it will tell you that the population is probably greater than 2/3 of n, while the sampling procedure tells you that the population is necessarily at least n+2. But as you mark more flies, the estimate will be more informative.)

Mostly because I thought it was fun, a worked example where you mark 100 different flies and then catch 100 unmarked flies (releasing each fly individually immediately after catching it; maybe you caught the same fly 100 times):

By the rule of three, we estimate the probability that a fly in your house is marked at p < 3/100.

We can also model the probability that a fly in your house is marked as 100/n, where n is the number of flies in your house.

Then 100/n < 3/100 and n > 100*100/3 = 3,333. There are probably more than 3333 flies in your house.

This reminds me of the time I found two flies in my house despite the windows being closed. I vacuumed them up (more fun that way) and the next day there were two identical looking flies in the house. Did they get out of the vacuum cleaner? I vacuumed them up again.

The next day, same story. The flies were now appearing throughout the day. They all looked the same. Where the hell were they coming from? The windows were still closed. I kept vacuuming.

Eventually things were starting to get very irritating. I hunted for an entry point for days without finding anything and the flies just kept on appearing. I was pretty good at vacuuming flies by this point.

By day six I spotted a group of them hanging out near my Kentia palm. Aha! A fly had laid eggs in the soil of the plant. I had no idea that it was a suitable food source for larvae. Needless to say I quickly filled it with gravel and I guess that's the story of how I became a qualified Fly Detective.

Another fun thing you can do with flies, and also bees and wasps, apart from vacuuming them up, is put them on a leash. But first you have to freeze them.

Catch one in a cup or plastic bag and stick it in the freezer for about 10 minutes. When you take it out it will look dead, but it's not (unless you leave it in too long.) Being careful not to rip it's wings off, tie a small string or fishing line to one of it's legs.

In a few minutes it will thaw and start to walk around, and then start to fly. You can now walk it around the park like you were carrying a balloon.

That’s animal abuse.
Where would you say it falls on the spectrum of animal abuse in relation to going fishing, fly swatting, and walking the dog? Those are all activities I'm personally ok with.
*insect abuse

I think repeatidly slapping them with a pretty soft plastic attraption until they stop moving is even harsher though.

Or slapping them outta the air, trying to electrocute them but having too little power on the shitty device so only their wings get burned.

Or vacuuming them up

Or flushing them down the drain.

Honestly, whatever people usually do to them it's way way crueller

> I think repeatedly slapping them with a pretty soft plastic contraption until they stop moving is even harsher

It's soft to you, not to the fly. You don't slap a fly "until it stops moving". When you swat a fly, it explodes.

This was far more engaging of a story than it deserves to be
That's scary. I used to be a hardcore insectaphobe, but I got over it over the last couple years. Now I don't care that much when I see insects in my house. But it is within reason! I wouldn't want bugs reproducing in my house, that sounds like a slippery slope!
> I wouldn't want bugs reproducing in my house, that sounds like a slippery slope!

Don’t Google dust mites.

Or Face Mites.
I admire your perception. I'm personally unable to distinguish different fly specimens, they all look the same to me.
It's only the size and age of them that makes one fly distinguishable from another. When they're all being born at the same time there's absolutely nothing that makes them distinct, and that's your clue that eggs have been laid somewhere in the vicinity.
> It's only the size and age of them that makes one fly distinguishable from another.

Because of their rarity, I find that the ones that wear tiny top hats are easily distinguishable too.

Reminds me of how the Allies estimated the production capacity of German tanks during WW2: https://medium.com/dataseries/how-data-science-gave-the-alli...
If you're interested, I referenced this and used the technique to estimate the quantity of STS SRBs recently:

https://space.stackexchange.com/questions/9261/how-many-soli...

Oh, no.... The date on that post is 2015... "recently"... I'm old...

My intuitive answer to that problem was to say that if we assume the captured serial numbers are randomly distributed, and the numbering starts at 1, then they will have the same average as all the numbers, so the estimate should be the average of captured serial numbers times 2. Which gives a result close to the formula used in this article, but not the same. I'm not sure where is the flaw.
If there are 100 tanks, and you get 1, 2, 5, and 99, your method would give 54 tanks ((1 + 2 + 5 + 99)/4 * 2), which is obviously wrong.

Your error is in stating "if we assume the captured serial numbers are randomly distributed" - you're assuming they're -uniformly- distributed. Randomly distributed != uniformly distributed.

Their method would give you 125 as a guess. It's including the known info (i.e., adding "m") to take into account the fact that they're not necessarily evenly distributed.

On that note, if you continued to get tanks at low numbers (3, 4, 6, etc), averaging gets -less- accurate, because that 99 becomes more and more of an outlier. Their method gets MORE accurate, again, because they're taking advantage of all data that is known (we know it goes at least to 99), and averaging doesn't. The new low numbers we've added mean that there are less likely to be many tanks, and the formula in the link takes that into account with m/k.

Both methods will be accurate if you have 100% of the data, but taking twice the average ignores known data, so the sparser the data the less likely it is to be correct.

Hmmm, on the other hand, suppose you first find a tank with the serial number 1234.

Then the next 50 tanks you find are all from the range [1, 100].

Is it more reasonable to assume that there are around 1258 tanks, or that there are probably closer to 100 tanks, and that first one with the very large serial number was not a sequentially numbered tank?

Certainly!

But, from the article's initial proposition - "You do know that the Germans have a sequential numbering system (1, 2, …, n)" and in giving historical context "On investigation, it became clear that the serial numbers were sequential, without gaps."

So, yes, without that being a prior, of course it's more likely that that outlier is a strange one off, and you'd do better to exclude it from your data set (and/or continue to investigate, because it's NOT at all clear that the serial numbers are sequential yet).

But, that context and ordering matters. Assume just the opposite series of events - you started by finding 50 tanks with serial numbers [1, 100]. And then three or four months go by you didn't get any tank serials sent to you. And then you get 1234. 1258 tanks seems really reasonable at that point (and, in fact, would fit the reality; the Germans were producing ~256 tanks per month per the article).

Great read!

In comparison, I have to wonder why the "intelligence estimates" were so bad/severe over estimates.

My intuition says counter-intelligence. The Allies were using intercepted communications, visual confirmation, captured sources, etc.

You can send fake reports if you think the other side is listening. You can move or mock material if you think the other side is watching. You can feed false information if you are captured.

And this is why you may be better off solving the “serial number problem” not by going to entirely random ones but instead change to one that implies false data that you want the Enemy to find.
Thank you! Great read, indeed!
paywalled for me.
This catch and release pattern is quite common in many wildlife surveys.

For example, when estimating tiger populations, instead of painting a white dot, rangers set up camera traps to take photos. Then they can use stripe patterns as a signature for subsequent re-appearances. Quite an interesting intersection for image AI with statistical counting methods.

Are there adjustments that need to be made for non-random sampling in that case? I'd imagine that with any territorial animal, a stationary camera is most likely to see the same animal multiple times.
True. I’m no expert in this area, but there are a lot more factors too - including territorial range and even different camera locations.
Fun idea, but presumably there is a big bias in that some flies will be easier to catch for various reasons. I suppose lots of population studies on birds, cetaceans or whatever have this bias, since you can't just slap a quadrat down and catch everything. In fact with these more advanced animals, perhaps they're harder to observe a second time since they've learned to be wary of humans.
Right, if you're measuring long enough that predators eating some of the flies could be an issue, then you'd want to mark them with something only visible under a blacklight, or similarly "neutral", so it doesn't introduce a bias.
If predators or age (or moving out or into the measured area) are a factor, they are already biased and keeping your method neutral will still give you very wrong results.
Unless the predator can see in the UV spectrum, like bats
My first ever job (in high school) was manually counting mosquito larvae in water samples for our local department of agriculture. IIRC the aforementioned technique doesn't really work for that, instead they collected samples regularly for several months at the same spots each year and used that for their forecasts.
I wonder how does that work. Why can you use the 5 flies you found to divide the total sample size?

It's kind of giving you an invariant number: * 100 flies -> 5 flies (5%), 100 / 0,05 = 2000 * 80 flies -> 4 flies (5%), 80 / 0,04 (not 0,05) = 2000

But if you used the percent instead (which seems to be what you hinted at) then the results vary: * 100 flies -> 5 flies (5%), 100 / 0,05 = 2000 * 80 flies -> 4 flies (5%), 80 / 0,05 = 1600

What's the principle, does it have a name I could read more about on Mathworld / Wiki?

You can read more about the technique here: https://en.wikipedia.org/wiki/Mark_and_recapture
But how do you catch flies!?!?!
Probably using a net and some sort of mild tranquilizer.
If they’re randomly distributed, that makes sense.