This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)
That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.
A similar approach is “bebugging” or fault seeding: purposely adding bugs to measure the effectiveness of your testing and to estimate how many real bugs remain. (Just don’t forget to remove the seeded bugs!)
That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.
In the "100 fish" example, the formula for approximating the total number of fish is:
total ~= caught / tagged
(where caught=100 in the example)
In their YouTube sampling method, the formula for approximating the total number of videos is:
total ~= (valid / tried) * 2^64
Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.
Did you understand where the 2^64 came from in their explanation btw?
I would have thought it would be (64^10)*16 according to their description of the string.
The YouTube identifiers are actually 64 bit integers encoded using url-safe base64 encoding. Hence the limited number of possible characters for the 11th position.
Catching fish is theoretically not perfectly random (risk-averse fish are less likely to get selected/caught) but that's the best method in those circumstances and it's reasonable to argue that the effect is insignificant.
You make a very weak argument, and are simply assuming the conclusion.
What makes it the "best method"? Would it be better to use a seine, or a trap, or hook-and-line? How would we know if there are subpopulations that have different likelihood of capture by different methods?
To say it's "reasonable to argue that the effect is insignificant" is purely assertion. Why is it unreasonable to argue that a fish could learn from the first experience and be less likely to be captured a second time?
If what you mean is that it's better than a completely blind guess, then I'd agree. But it's not clearly the best method nor is it clearly unbiased.
Fair points. But, mark-recapture is about practicality. It's not perfect, but it's a solid compromise between accuracy and feasibility (so I mean best in these regards, to be 100% clear). Sure, different methods might skew results, but this technique is about getting a reliable estimate, not pinpoint accuracy. As for learning behavior in fish, that's considered in many studies (and many other things, like listed here: https://fishbio.com/fate-chance-encounters-mark-recapture-st... ), but overall, it doesn't hugely skew the population estimates. So, again, it's about what works best in the field, not in theory.
> You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.
Won't this mess up stats though? It's like a lake monster randomly swapping an untagged fish with tagged fish as you catch them.