nets too small (not enough layers)
gradients not flowing (residual connections)
layer outputs not normalized
training algorithms and procedures not optimal (Adam, warm-up, etc)