Hacker News new | ask | show | jobs
by 0x1ceb00da 912 days ago
This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)
9 comments

That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.
A similar approach is “bebugging” or fault seeding: purposely adding bugs to measure the effectiveness of your testing and to estimate how many real bugs remain. (Just don’t forget to remove the seeded bugs!)

https://en.m.wikipedia.org/wiki/Bebugging

But this implies that all bugs are of equal likelihood of being found which I would highly doubt, no?
Yes, it's obviously not a perfect estimate, but can be directionally helpful.

You could bucket bugs into categories by severity or type and that might improve the estimate, as well.

Oh this is a really interesting concept.

I guess it underestimates the number hard to find bugs though since it assumes same likelyhood to be found.

That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.

In the "100 fish" example, the formula for approximating the total number of fish is:

    total ~= caught / tagged
    (where caught=100 in the example)
In their YouTube sampling method, the formula for approximating the total number of videos is:

    total ~= (valid / tried) * 2^64
Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.
Did you understand where the 2^64 came from in their explanation btw? I would have thought it would be (64^10)*16 according to their description of the string.

Edit: Oh because 64^10 * 16 = (2^6)^10 * (2^4)

The YouTube identifiers are actually 64 bit integers encoded using url-safe base64 encoding. Hence the limited number of possible characters for the 11th position.
Do you get the same 100 dumb fish?
Why are they dumb? Free tag.
Imagine being the only fish without a tag. Everyone at school will know how lame you are.
It would be illegal not to have a tag. If the fish has nothing to hide, it shouldn't worry about being tagged.

And, also, the fish gets tagged for its own good.

>Everyone at school will know how lame you are.

They'll even call you tinfoil fish.

This comment. Please see here.
Catching fish is theoretically not perfectly random (risk-averse fish are less likely to get selected/caught) but that's the best method in those circumstances and it's reasonable to argue that the effect is insignificant.
You make a very weak argument, and are simply assuming the conclusion.

What makes it the "best method"? Would it be better to use a seine, or a trap, or hook-and-line? How would we know if there are subpopulations that have different likelihood of capture by different methods?

To say it's "reasonable to argue that the effect is insignificant" is purely assertion. Why is it unreasonable to argue that a fish could learn from the first experience and be less likely to be captured a second time?

If what you mean is that it's better than a completely blind guess, then I'd agree. But it's not clearly the best method nor is it clearly unbiased.

Fair points. But, mark-recapture is about practicality. It's not perfect, but it's a solid compromise between accuracy and feasibility (so I mean best in these regards, to be 100% clear). Sure, different methods might skew results, but this technique is about getting a reliable estimate, not pinpoint accuracy. As for learning behavior in fish, that's considered in many studies (and many other things, like listed here: https://fishbio.com/fate-chance-encounters-mark-recapture-st... ), but overall, it doesn't hugely skew the population estimates. So, again, it's about what works best in the field, not in theory.
In my experience conservation biologists are really good at finding animals in the wild. Much better than a typical SWE or typical business person.
Wouldn't a previously caught fish be less likely to fall for the same trick a second time?
only if you're within a 100 mile radius of me the ultimate dumb fish
I made the same connection but it’s still the first time I’ve seen it used for reverse looking up IDs.
It’s not even new in the YouTube space as they acknowledge from 2011

https://dl.acm.org/doi/10.1145/2068816.2068851

Also related is the unseen species problem (if you sample N things, and get Y repeats, what's the estimated total population size?).

https://en.wikipedia.org/wiki/Unseen_species_problem http://www.stat.yale.edu/~yw562/reprints/species-si.pdf

> You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.

Won't this mess up stats though? It's like a lake monster randomly swapping an untagged fish with tagged fish as you catch them.

Isn’t this just a variation of the Monte Carlo method?
That's only vaguely the same. It would be much closer if they divided the lake into a 3D grid and sampled random cubes from it.