| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by herpderperator 25 days ago
	The visualiser seems to be quite naive with what it defines as a token. I don't think a token is an entire word as often as the demo shows, and when it gets to the `def estimate_tokens` method, the entire `# Rough heuristic: ~1 token per 4 chars of English` comment is printed all at once as one token, which is certainly not accurate. This is not a realistic replay of what a common LLM might actually print out - it's entirely fabricated. But for the purpose of estimating the feel of tokens per second, I suppose it's good enough.

1 comments

davely 25 days ago

I built something similar awhile back [1] and used OpenAI’s tokenizer playground [2] to recalculate tokens on a giant block of lorem ipsum text. I feel like this gives a much more accurate representation.

[1] https://dave.ly/tools/tokenflow/

[2] https://platform.openai.com/tokenizer