|
|
|
|
|
by dentalperson
1218 days ago
|
|
How does token->embedding lookup work with 1.4T BPE tokens? Since there are more tokens than the 65B parameters it must be doing some sort of interesting thing based on the merge operations. Is it different from what other GPT models with ~100k tokens are doing? At inference, how many of those tokens are used? (they mention most tokens are used only once during training, so the must be very long sequences.) |
|