| HN Mirror

A lot of machine-learning papers are eight pages. Speech conference papers (heavy users of neural nets) are often only four. Some details aren't part of the main message, so don't make it in. Often code is available, and initialization and other tweaks can be found in there (even if you aren't going to use their code).

That said, there are also whole papers, even collected volumes, on initialization and other practical details.

Textbooks aren't always up-to-date with the latest practical knowledge, as deep-learning practice is moving quickly. Or they simply don't want to clutter their high-level maths descriptions with code-level implementation details. Teaching stuff is all about tradeoffs. I'm sure several books do mention the scale of weights for simple feed-forward weights though, as it's not an implementation-level detail, and it's probably been well known since the 1980s.