Hacker News new | ask | show | jobs
by radq 1060 days ago
Do outlier features emerge in sub-100M parameter models? I haven't seen any research discuss it below the 124M scale (bert-base). At that scale training a model takes ~4 days on an 8xA100 node.
1 comments

That is a fair question, and in addition I'm unsure that a simple metric like perplexity is likely to pick it up.

However, I do think that if perplexity showed a lower drop-off using this modified softmax under quantization that would be an exciting finding and enough to indicate further experiments would definitely be worth doing.

But you are right - if it doesn't show an improvement it doesn't necessarily rule out that it could be helping.

Edit: In the Qualcomm AI paper mentioned in this post, they experiment on BERT uncased (109B param) and OPT 125M and are able to show the effects using perplexity.

I hadn't read the paper when I suggested the same approach, so I guess that is good validation it is worth trying.

Edit2: Actually they also test on ViT 22M, which would be even quicker to try I think.