Hacker News new | ask | show | jobs
by liuliu 1919 days ago
It is mostly applicable to transformer models, the ideas in the paper would be alien if you work on computer vision.

In transformer models, big chunk of memory was parameters, and states for optimizers (because vanilla SGD not used there). The memory optimization technique that removes parameters duplication on each GPU or offload entirely to CPU makes sense.

In computer vision, big chunk of memory was hold by forward layer activations and the memory optimization technique applicable in these cases would be binomial checkpointing.