Hacker News new | ask | show | jobs
by fritzo 94 days ago
Hot take: bell curves are everywhere exactly because the math is simple.

The causal chain is: the math is simple -> teachers teach simple things -> students learn what they're taught -> we see the world in terms of concepts we've learned.

The central limit theorem generalizes beyond simple math to hard math: Levy alpha stable distributions when variance is not finite, the Fisher-Tippett-Gnedenko theorem and Gumbel/Fréchet/Weibull distributions regarding extreme values. Those curves are also everwhere, but we don't see them because we weren't taught them because the math is tough.

6 comments

It also took me a little while to realize “least squares” and MMSE approaches were not necessarily the “correct” way to do things but just “one thing we actually know how to do” because everything else is much harder.

We can use Calculus to do so much but also so little…

That isn't the case; mathematicians will do pages of calculations (particularly and especially the statisticians) if they can prove one approach is technically superior to another. These people, as a class, are the crazies who invented matrix multiplication. Something like MMSE is used because it provably optimum properties for estimating a posterior distribution.

It is certainly possible that there are complex approaches that the statisticians have not discovered or don't teach because they are too complicated, but they had a big fight about which techniques were provably superior early in the discipline's history and the choices of what got standardised on weren't because of ease of calculation. It has actually been quite interesting how little interest the statisticians are likely to be taking in things like the machine learning revolution since the mathematics all seems pretty amenable to last century's techniques despite orders of magnitude differences in the data being handled.

> optimum properties for estimating a posterior distribution

Circular reasoning: that's true only if the posterior is normal, or if your "optimal" is defined by second moments. In infinite variance cases, the best estimator can be median or an alpha moment for alpha < 2, but yikes the math is much more difficult.

-- A mathematician who has indeed fallen into the beauty trap

> Circular reasoning: that's true only if the posterior is normal, or if your "optimal" is defined by second moments.

That doesn't sound right, it is an error minimising technique. Are we not talking about minimising mean square errors? Why would the posterior need to be normal? And why would optimal need to be defined by 2nd moments?

I've often described this as a bias towards easily taught ("teachable") material over more realistic but difficult to teach material. Sometimes teachers teach certain subjects because they fit the classroom well as a medium. Some subjects are just hard to teach in hour-long lectures using whiteboards and slides. They might be better suited to other media, especially self study, but that does not mean that teachers should ignore them.
Most things aren't infinite or extreme, though. Almost by definition, most phenomena aren't extreme phenomena.
No, but when you get into the nitty gritty of most things sometimes being influenced by extremely rare things, and also that the convergence rate of the central limit theorem is not universal at all, then much of the utility (and apparent universality) of the CLT starts to evaporate.

In practice when modeling you are almost always better not assuming normality, and you want to test models that allow the possibility of heavy tails. The CLT is an approximation, and modern robust methods or Bayesian methods that don't assume Gaussian priors are almost always better models. But this of course brings into question the very universality of the CLT (i.e. it is natural in math, but not really in nature).

Heavy tails are everywhere. Normal distributions have absurdly light tails. Levy alpha stable distributions have power law tails. Power law tails are everywhere.

Some things with heavy tails:

  token occurrences
  comment thread upvotes
  startup IPOs
  social follower counts
  network latency
  github stars
  git diffs
  power station size
  weather events
The CLT is everywhere because convolution/adding independentish random variables is a super common thing to do.
Right. And the CLT is not actually limited to normal distributions. Both of the distribution families I mentioned are central limit theorems. The CLT we first see in school regards means of finite variance distributions, where the finite variance assumption is made because it makes the math easier.

https://en.wikipedia.org/wiki/Central_limit_theorem#The_gene...

any good resources to understand more about them?
That’s exactly the right take and the article proves it:

Statisticians love averages so everywhere that could be sampled as a normal distribution will be presented as one

The median is actually more descriptive and power law is equally as pervasive if not more

combining repeated samples of any distribution* (any population density fuction including power law distributions) will converge to the normal distribution, that's why it appears everywhere.

* excluding bizarre degenerates like constants or impulse functions

No, that's not correct. Sums of power law distributions can converge to power low tailed distributions, not normal distributions.
No use arguing with them they don’t have enough mathematical understanding to understand what they’re saying