| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 0x1ceb00da 912 days ago
	This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)

9 comments

pants2 912 days ago

That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.

cpeterso 912 days ago

A similar approach is “bebugging” or fault seeding: purposely adding bugs to measure the effectiveness of your testing and to estimate how many real bugs remain. (Just don’t forget to remove the seeded bugs!)

https://en.m.wikipedia.org/wiki/Bebugging

mewpmewp2 912 days ago

But this implies that all bugs are of equal likelihood of being found which I would highly doubt, no?

pants2 912 days ago

Yes, it's obviously not a perfect estimate, but can be directionally helpful.

You could bucket bugs into categories by severity or type and that might improve the estimate, as well.

rightbyte 912 days ago

Oh this is a really interesting concept.

I guess it underestimates the number hard to find bugs though since it assumes same likelyhood to be found.

justinpombrio 912 days ago

That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.

In the "100 fish" example, the formula for approximating the total number of fish is:

    total ~= caught / tagged
    (where caught=100 in the example)

In their YouTube sampling method, the formula for approximating the total number of videos is:

    total ~= (valid / tried) * 2^64

Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.

ad404b8a372f2b9 912 days ago

Did you understand where the 2^64 came from in their explanation btw? I would have thought it would be (64^10)*16 according to their description of the string.

Edit: Oh because 64^10 * 16 = (2^6)^10 * (2^4)

dajonker 912 days ago

The YouTube identifiers are actually 64 bit integers encoded using url-safe base64 encoding. Hence the limited number of possible characters for the 11th position.

zellyn 912 days ago

Do you get the same 100 dumb fish?

p1mrx 912 days ago

Why are they dumb? Free tag.

labster 912 days ago

Imagine being the only fish without a tag. Everyone at school will know how lame you are.

DeathArrow 912 days ago

It would be illegal not to have a tag. If the fish has nothing to hide, it shouldn't worry about being tagged.

And, also, the fish gets tagged for its own good.

DeathArrow 912 days ago

>Everyone at school will know how lame you are.

They'll even call you tinfoil fish.

eek2121 912 days ago

This comment. Please see here.

egeozcan 912 days ago

Catching fish is theoretically not perfectly random (risk-averse fish are less likely to get selected/caught) but that's the best method in those circumstances and it's reasonable to argue that the effect is insignificant.

nkurz 912 days ago

You make a very weak argument, and are simply assuming the conclusion.

What makes it the "best method"? Would it be better to use a seine, or a trap, or hook-and-line? How would we know if there are subpopulations that have different likelihood of capture by different methods?

To say it's "reasonable to argue that the effect is insignificant" is purely assertion. Why is it unreasonable to argue that a fish could learn from the first experience and be less likely to be captured a second time?

If what you mean is that it's better than a completely blind guess, then I'd agree. But it's not clearly the best method nor is it clearly unbiased.

egeozcan 912 days ago

Fair points. But, mark-recapture is about practicality. It's not perfect, but it's a solid compromise between accuracy and feasibility (so I mean best in these regards, to be 100% clear). Sure, different methods might skew results, but this technique is about getting a reliable estimate, not pinpoint accuracy. As for learning behavior in fish, that's considered in many studies (and many other things, like listed here: https://fishbio.com/fate-chance-encounters-mark-recapture-st... ), but overall, it doesn't hugely skew the population estimates. So, again, it's about what works best in the field, not in theory.

lanstin 911 days ago

In my experience conservation biologists are really good at finding animals in the wild. Much better than a typical SWE or typical business person.

panarky 912 days ago

Wouldn't a previously caught fish be less likely to fall for the same trick a second time?

soonwitdafishis 912 days ago

only if you're within a 100 mile radius of me the ultimate dumb fish

dclowd9901 912 days ago

I made the same connection but it’s still the first time I’ve seen it used for reverse looking up IDs.

midasuni 912 days ago

It’s not even new in the YouTube space as they acknowledge from 2011

https://dl.acm.org/doi/10.1145/2068816.2068851

krackers 912 days ago

Also related is the unseen species problem (if you sample N things, and get Y repeats, what's the estimated total population size?).

https://en.wikipedia.org/wiki/Unseen_species_problem http://www.stat.yale.edu/~yw562/reprints/species-si.pdf

neurostimulant 911 days ago

> You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.

Won't this mess up stats though? It's like a lake monster randomly swapping an untagged fish with tagged fish as you catch them.

fergbrain 912 days ago

Isn’t this just a variation of the Monte Carlo method?

layer8 912 days ago

That's only vaguely the same. It would be much closer if they divided the lake into a 3D grid and sampled random cubes from it.