Hacker News new | ask | show | jobs
by gcr 317 days ago
The entire HP series is about one million words.
1 comments

Harry Potter and the Order of Phoenix alone is 400K tokens.
Curious, I found an epub, converted it to a txt, and dumped it into the Qwen3 tokenizer. It yielded 359,088 tokens, end to end.

Using the GPT-4 tokenizer (cl100k_base) yields 349,371 tokens.

Recent Google and Anthropic models do not have local tokenizers and ridiculously make you call their APIs to do it, so no idea about those.

Just thought that was interesting.

And takes up a proportional width of everyone's bookshelves along side the others.