Hacker News new | ask | show | jobs
by Imnimo 1165 days ago
Setting aside the specific choice of tokenizer for GPT models, I'm curious how much difference in performance is made by the features of the human language used to represent the training data. Like if you kept the exact same training corpus and could wave a magic wand and translate it into any language and could create a custom tokenization for each language, would some be more amenable than others to GPT-style language modeling?