Hacker News new | ask | show | jobs
by teraflop 3345 days ago
Yeah, I think filtering is a big part of it. If you want to answer a statistical question about the entire dataset, then a random sample is probably good enough. If you want to drill down and do an analysis that only looks at a particular narrow slice of the data, then it's likely that the corresponding subset of your sample isn't big enough to be meaningful.

(You can pre-filter or pre-aggregate before sampling, but that assumes you know a priori what types of queries you'll want to do.)