Hacker News new | ask | show | jobs
by sfifs 1228 days ago
For people who can write code, the simplest exercise to convince yourself of foundational statistics is simulations.

Create a simulated population with some distribution of a metric & run multiple sampling simulations. You'll be surprised. You can even put in sampling biases as test the impact.

Monte Carlo simulations are a surprisingly powerful tool. I once discovered that FAANG data scientists were mis-understanding statistical significance in a reporting product they made by half an order of magnitude because they didn't understand the impact of observationalmethodology and sampling bias in their product. In my company, we set our own thresholds much larger than what the product recommended.

1 comments

Right, but this just reinforces my thought here. In order to simulate sampling, I have to know the data well enough to simulate it. Which, for many things I'd care about, if I knew the underlying distribution that well, I probably don't need to sample. :(
i meant doing it as a theoritical planning exercise. you can throw in any number of weird distributions you might guess and you'll be surprised at how quickly sampling will fairly reliably pick up patterns and this helps you plan your sampling around uncertainty.

Of course if your underlying distribution is likely to be Gaussian which is true for many phenomena, you don't need to bother except as a pedagogical exercise.

If you know a bit of programming, that's actually sufficient to explore these ideas and verify them for yourself.

Allen Downey has a ton of open source books that use this philosophy [0] and Peter Norvig has used Python notebooks in a similar manner (look at the ones in the Probability section) [1].

[0] https://greenteapress.com/wp/ [1] https://github.com/norvig/pytudes#pytudes-index-of-jupyter-i...

> Which, for many things I'd care about, if I knew the underlying distribution that well, I probably don't need to sample

You don't have to sample directly. The entire field of Bayesian variational learning exist to deal with that very problem. Look up Markov chain Monte Carlo, Metropolis algorithm, conjugate priors, reparametrization tricks.

Thanks for the pointers, will be looking into these!