|
|
|
|
|
by sangnoir
806 days ago
|
|
> Want to use more than 128GB of memory? Too bad, your $10B chip doesn’t support that. Which is probably why Meta is also buying the biggest Nvidia datacenter cards by the shipload. There is no need to run inference for a small model - say for a text-ad recommendation system - on an H100 with attendant electricity and cooling costs. |
|
E.g. it's common to have a full-width accumulator and e.g. s16 gradients with u8 activations and s8 weights, with the FMA (MAC) chain of the tensor multiply operation post-scaled with a learned u32 factor plus follow-up "learned" notify, which effectively acts as a fixed-point factor with learned position of it's point, to re-scale the outcome to the u8 activation range.
By having the gradients by sufficiently wider, it's practical to use a straight-through estimator for backpropagation. I read a paper (kinda two, actually) a few months ago that dealt with this (IIRC one of them was more about the hardware/ASIC aspects of fixed-point tensor cores, the other more about model training experiments with existing low precision integer-MAC chips IIRC particularly for interference in mind). If requested, I can probably find it by digging through my system(s); I would have already linked it/them if the cursory search hadn't failed.