Hacker News new | ask | show | jobs
by ivalm 2493 days ago
IIRC they use batch size = 1 and each core only know about one layer. Which is to say this thing has to be trained very differently from normal SGD (but requires very little memory). There is also the issue that they rely on sparseness, which you get with relu activations, but if, for example, language models move to gelu activations they will be somewhat screwed.