Hacker News new | ask | show | jobs
by lupire 567 days ago
Training data is already provided by humans and certainly already does include spelling instruction, which the model is bind to because of forced tokenization. Tokenizing on words is already an arbitrary capability added one at a time. It's just the wrong one. LLMs should be tokenizing by letter, but they don't, because they aren't good enough yet, so they get a massive deus ex machina (human ex machina?) of wordish tokenization.