Hacker News new | ask | show | jobs
by rprenger 1492 days ago
I don't think a larger vocab would help. All the individual letters are in the ~50k token vocab already, but the word "alphabet" will still not get tokenized to [a, l, p, h, a, b, e, t]. Using a larger vocab like PaLM's 256k vocab would have the same issue.