Hacker News new | ask | show | jobs
by sva_ 698 days ago
Does it make sense to measure a dataset in tokens? Shouldn't it be tokenizer-agnostic? I.e. the OpenAI tokenizer encodes about ~4 characters per token, but I could also have a tokenizer that does 1 character per token leading to a ~4x increase in token count (relative to the OpenAI tokenizer.)
1 comments

Hello! Totally agree that tokens will be model dependent. We chose to calculate tokens using the GPT-2 tokenizer as that is a common metric used by other datasets like fineweb. So this should roughly give you a sense of how large the data is in comparison to others. We report other metrics too like number of documents and number of images.
How does the GPT-2 tokenizer deal with non-text input? This dataset is multimodal but I thought GPT-2 was text only.