Hacker News new | ask | show | jobs
by SuchAnonMuchWow 940 days ago
Skip connection increase the live range of one intermediate result across the whole part of the network skiped: the tensor at the beginning of a skip connection must be stored in memory for longer while unrelated computation happen: it increase the pressure on the memory hierarchy (either the L2, or scratchpad memory).

This is especially true for example for inference for vision transformers, where it decrease the batch size you can use before hitting the L2 capacity wall.

1 comments

Okay, I see that for inference. But for training it shouldn't matter because I need to hold on to all my activations for my backwards pass anyways? But yeah, fair point!