Hacker News new | ask | show | jobs
by noddybear 9 days ago
Aren’t Unicode characters generally treated as 2 tokens to avoid a huge vocabulary?