| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wolfium3 1167 days ago
	You can use their online tool to see how it tokenizes words: https://platform.openai.com/tokenizer

1 comments

minimaxir 1167 days ago

It's worth noting that this only for GPT-3. If you're using ChatGPT or GPT-4, both use a different tokenizer that's more robust and uses/generates about 10% fewer tokens. (unclear how well it performs for non-English languages)

You can test it offline using tiktoken: https://github.com/openai/tiktoken

link

dchest 1167 days ago

Here's online version: https://tiktokenizer.vercel.app/

link

sillysaurusx 1167 days ago

10% smaller vocab size, or 10% fewer tokens on average? I assume the latter, but total vocab size is also an interesting metric.

The tokenization speedups in that repo are very impressive. It was the most annoying part about processing 190,000 books. I think it took a few days on a server with 96 cores.

Surprisingly hard to figure out the vocab size from that repo.

link

minimaxir 1167 days ago

10% fewer tokens on average.

The vocab size itself is doubled. (~50k for GPT-2/3, ~100k for ChatGPT)

link

sillysaurusx 1167 days ago

Wow. Does that help to double the vocab size?

It certainly makes training more expensive. One clever trick to get some memory savings is to freeze the vocab embedding layer when fine tuning. It makes a noticeable improvement, both in speed and in mem required.

Surprised they went the larger vocab route. LLaMA is only 30k. I wonder what the reason is...

Thanks!

link

minimaxir 1167 days ago

A larger vocab takes longer to train but has no (practical) impact at inference time as an Embeddings index is just a key-value store, which is very helpful as GPT starts hitting scaling laws.

link