| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sva_ 698 days ago
	Does it make sense to measure a dataset in tokens? Shouldn't it be tokenizer-agnostic? I.e. the OpenAI tokenizer encodes about ~4 characters per token, but I could also have a tokenizer that does 1 character per token leading to a ~4x increase in token count (relative to the OpenAI tokenizer.)

1 comments

anas-awadalla 698 days ago

Hello! Totally agree that tokens will be model dependent. We chose to calculate tokens using the GPT-2 tokenizer as that is a common metric used by other datasets like fineweb. So this should roughly give you a sense of how large the data is in comparison to others. We report other metrics too like number of documents and number of images.

link

reverius42 697 days ago

How does the GPT-2 tokenizer deal with non-text input? This dataset is multimodal but I thought GPT-2 was text only.

link