Hacker News new | ask | show | jobs
by kromem 690 days ago
In general this needs to be done across the board.

The perplexity per parameter is higher and the delta grows as it scales.

Not per bit, but per parameter.

Why this is happening really needs more attention and more consideration for pretrained model development right now.

A sleeping giant of a difference in a space where even marginal gains make headlines.