Why 5 hidden layers? Why are the first 3 non-recurrent? How did you decide how wide to make the internal layers? Are there some organizing principles behind these design decisions, or is it just trial and error?
As in many things, it's a combination of both. For example:
- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.
- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.
Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.
- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.
- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.
Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.