|
|
|
|
|
by sva_
698 days ago
|
|
Does it make sense to measure a dataset in tokens? Shouldn't it be tokenizer-agnostic? I.e. the OpenAI tokenizer encodes about ~4 characters per token, but I could also have a tokenizer that does 1 character per token leading to a ~4x increase in token count (relative to the OpenAI tokenizer.) |
|