| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by karalala 815 days ago
	True, but they normally arent this far off. HGRN claims that they outperform transformer for 1B parameter model trained on the pile. HGRN performing 8ppl worse suggests that its useless.

1 comments

AIsore 814 days ago

My experience - many are far off and most of the time published tables of different papers are hard to compare. If you make the assertion here of these results to be flawed, I would like to see more substance (code, reproduction,...). And for balance, for the same reason, hard to verify the accuracy of these results without further insight.

link

logicchains 814 days ago

So many papers play tricks with the learning rate schedule: https://arxiv.org/abs/2307.06440

link