|
|
|
Can you LLM a custom language?
|
|
1 points
by campervans
840 days ago
|
|
If token limit and accuracy are important, it seems English (or other spoken languages) are no optimal. They're a butchered product of history and easy verbal noises. A new custom language seems inevitable, that is concise, unambiguous, rooted in relation with custom words. Replacing common sentences with simple strings such as "Once upon a time..." to "a1" Most likely alpha-numeric, to minimise tokens, and generate an order of magnitude increase in context window. Followed by translation back to {language} Is this possible? Anyone working on it? (here to be educated) |
|
This is what byte-pair encoding does. It doesn't go quite so far as to allocate only a single token to "Once upon a time", because that string isn't actually that common, but in principle it could.
Trying to get humans to produce content directly in such a concise representation is a waste of time, since LLMs heavily rely on the ability to take whatever content is already available on the internet, which drastically reduces the labor cost of acquiring training data.