| HN Mirror

Yes, rhymer has hit on the primary tradeoff behind Algorithm A. While Algorithm A is fast, easy, and makes it possible to sample any dataset you can access with SQL, including massive distributed datasets, its draws from the population are not independent and identically distributed (because with each draw you make the population one item smaller), nor does it let you compute the inclusion probabilties that would let you use the most common reweighting methods, such as a Horvitz–Thompson estimator, to produce unbiased estimates from potentially biased samples.

In practice, however, when your samples are small, your populations are large, and your populations' weights are not concentrated in a small number of members, you can use Algorithm A and just pretend that you have a sample of i.i.d. draws. In these circumstances, the potential for bias is generally not worth worrying about.

But when you cannot play so fast and loose, you can produce unbiased estimates from an Algorithm A sample by using an ordered estimator, such as Des Raj's ordered estimator [1].

You could alternatively use a different sampling method, one that does produce inclusion probabilities. But these tend to be implemented only in more sophisticated statistical systems (e.g., R's "survey" package [2]) and thus not useful unless you can fit your dataset into those systems.

For very large datasets, then, I end up using Algorithm A to take samples and (when it matters) Des Raj's ordered estimator to make estimates.

(I was planning a follow up post to my blog on this subject.)

[1] The original papers are hard to come by online, so I suggest starting with an introduction like https://home.iitk.ac.in/~shalab/sampling/chapter7-sampling-v..., starting at page 11.

[2] https://cran.r-project.org/web/packages/survey/survey.pdf