Hacker News new | ask | show | jobs
by cztomsik 394 days ago
hm, residual is what I would not expect, can you elaborate why?
1 comments

Avoids vanishing gradients in deeper networks.

Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.