| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dentalperson 1218 days ago
	How does token->embedding lookup work with 1.4T BPE tokens? Since there are more tokens than the 65B parameters it must be doing some sort of interesting thing based on the merge operations. Is it different from what other GPT models with ~100k tokens are doing? At inference, how many of those tokens are used? (they mention most tokens are used only once during training, so the must be very long sequences.)

1 comments

bitRAKE 1218 days ago

The 1.4T tokens are what the model was trained on, and not the token range of the embedding.

link

dentalperson 1218 days ago

Ah, that makes more sense, thank you. Since this was mentioned in the tokenizer section and the number of unique tokens wasn't mentioned I misunderstood.

link