|
|
|
|
|
by SuchAnonMuchWow
940 days ago
|
|
Skip connection increase the live range of one intermediate result across the whole part of the network skiped:
the tensor at the beginning of a skip connection must be stored in memory for longer while unrelated computation happen: it increase the pressure on the memory hierarchy (either the L2, or scratchpad memory). This is especially true for example for inference for vision transformers, where it decrease the batch size you can use before hitting the L2 capacity wall. |
|