| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sillysaurusx 1161 days ago

Wow. Does that help to double the vocab size?

It certainly makes training more expensive. One clever trick to get some memory savings is to freeze the vocab embedding layer when fine tuning. It makes a noticeable improvement, both in speed and in mem required.

Surprised they went the larger vocab route. LLaMA is only 30k. I wonder what the reason is...

Thanks!

1 comments

minimaxir 1161 days ago

A larger vocab takes longer to train but has no (practical) impact at inference time as an Embeddings index is just a key-value store, which is very helpful as GPT starts hitting scaling laws.

link