Hacker News new | ask | show | jobs
by brizii 790 days ago
my group is currently working on a T5 model (and tokenizer) for html, as there are very few (if any) tokenizers that work well with HTML!

You can try using GPT4's tokenizer on your own HTML inputs below [1] ... there's definitely room for improvement!

[1] https://tiktokenizer.vercel.app