| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by anas-awadalla 700 days ago
	Hello! Totally agree that tokens will be model dependent. We chose to calculate tokens using the GPT-2 tokenizer as that is a common metric used by other datasets like fineweb. So this should roughly give you a sense of how large the data is in comparison to others. We report other metrics too like number of documents and number of images.

1 comments

How does the GPT-2 tokenizer deal with non-text input? This dataset is multimodal but I thought GPT-2 was text only.