| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by teraflop 3345 days ago
	Yeah, I think filtering is a big part of it. If you want to answer a statistical question about the entire dataset, then a random sample is probably good enough. If you want to drill down and do an analysis that only looks at a particular narrow slice of the data, then it's likely that the corresponding subset of your sample isn't big enough to be meaningful. (You can pre-filter or pre-aggregate before sampling, but that assumes you know a priori what types of queries you'll want to do.)