| HN Mirror

I am sorry but I have to call bullshit on this.

To give just a taste for the nice properties of KL, if you are using a layer 1 NN with the sigmoid function as the transform, using square loss gives you an explosion of local minima. OTOH using KL in its place would have given you none. Numerically accuracy is pretty much a non-issue, people have known how to handle KL numerically since the last 40 or so years.

BTW using KL on equivariant Gaussian gives you square loss, apparently the loss you prefer.