Hacker News new | ask | show | jobs
by ioedward 1115 days ago
You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.
1 comments

Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.