Hacker News new | ask | show | jobs
by hyperbovine 4198 days ago
Why 5 hidden layers? Why are the first 3 non-recurrent? How did you decide how wide to make the internal layers? Are there some organizing principles behind these design decisions, or is it just trial and error?
1 comments

As in many things, it's a combination of both. For example:

- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.

- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.

Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.