| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kromem 737 days ago

In general this needs to be done across the board.

The perplexity per parameter is higher and the delta grows as it scales.

Not per bit, but per parameter.

Why this is happening really needs more attention and more consideration for pretrained model development right now.

A sleeping giant of a difference in a space where even marginal gains make headlines.