Hacker News new | ask | show | jobs
by make3 1161 days ago
using plain characters would make the sentences longer & cost much more money to use.

that's the idea of byte pair encoding based tokenizers, reduce the average sentence's number of tokens to an optimal (short) size to reduce the computational cost. in this case, most of its training data is in english so it's going to have shorter sentences (nb of tokens) in english vs other languages