| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kherud 806 days ago
	Now that context length seems abundant for most tasks, I'm wondering why sub-word tokens are still used. I'm really curious how character-based LLMs would compare. With 2 M context, the compute bottleneck fades away. I'm not sure though what role the vocabulary size has. Maybe a large size is critical, since the embedding already contains a big chunk of the knowledge. On the other hand, using a character-based vocabulary would solve multiple problems, I think, like glitch tokens and possibly things like arithmetic and rhyming capabilities. Implementing sub-word tokenizers correctly and training them seems also quite complex. On a character level this should be trivial.

4 comments

AaronFriel 806 days ago

The attention mechanism is vastly more efficient to train when it can attend to larger, more meaningful tokens. For inference servers, a significant amount of memory goes into the KV cache, and as you note, to build up the embedding through attention would then require correlating far more tokens, each of which is "less meaningful".

I think we may get to this point eventually, in the limit we will want multimodal LLMs that understand images and sounds down to the pixel and frequency, and it seems like for text, too, we will eventually want that as well.

link

thomasahle 806 days ago

Maybe you could just use a good-old 1D-CNN for the bottom 3-4 layers. Then the model has been able to combine characters into roughly token length chunks anyway.

Just make sure to have some big MLPs at the start too, to enrich the "tokens" with the information currently stored in the embedding tables.

link

yk 806 days ago

> a significant amount of memory goes into the KV cache

Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)

link

AaronFriel 805 days ago

The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180

link

darby_eight 806 days ago

> On a character level this should be trivial.

Characters are not the semantic components of words—these are syllables. Generally speaking, anyway. I've got to imagine this approach would yield higher quality results than the roman alphabet. I'm curious if this could be tested by just looking at how LLMs handle English vs Chinese.

link

inbetween 806 days ago

The minimal semantic parts of words are morphemes. Syllables are phonological units (roughly: the minimal unit for rhythmic purposes such as stress, etc)

link

darby_eight 806 days ago

Only in languages that have morphemes! This is hardly a universal attribute of language so much as an attribute of those that use an alphabet to encode sounds. It makes more sense to just bypass the encoding and directly consider the speech.

Besides, considering morphemes as semantic often results in a completely different meaning than we actually intend. We aren't trying to train a chatbot to speak in prefixes and suffixes, we're trying to train a chatbot to speak in natural language, even if it is encoded to latin script before output.

link

inbetween 805 days ago

That's technically wrong. Every language has morphemes for the simple reason that every word is at least one morpheme. `cat` is a morpheme. `cats` is two morphemes (cat-s).

(The point about semantics is also technically wrong. You would first need to specify your view of semantic compositionality before such a point can be evaluated, but the usual views of semantics don't have any such consequence.)

link

darby_eight 805 days ago

> Every language has morphemes for the simple reason that every word is at least one morpheme.

Sure, if you define "morpheme" as a collection of syllables that's meaningful to people using alphabetic script. I don't see any benefit to this compared to working with syllables directly, which is a meaningful concept regardless of the script used to encode them.

link

dragonwriter 804 days ago

> Sure, if you define “morpheme” as a collection of syllables

Cats, as noted, has two morphemes, despite having only one syllable. Syllables and morphemes are largely orthogonal, morphemes can be less than, equal to, or more than a syllable (and even when more than, may or may not start or end on a syllable boundary.)

(Also, syllables aren’t the minimal semantic units even of spoken speech, those are phonemes – a syllable consists of at least one phoneme, potentially more. But morphemes, even an alphabetic script if it isn’t perfectly phonetic, still don’t necessarily map to one or more phonemes, since is textual semantic unit may have no effect on pronunciation.)

link

inbetween 805 days ago

You might not see any benefit, but that's what those words mean :) Grab any textbook, it is linguistics 101!

link

joaogui1 806 days ago

I would say 2 big problems are:

1. latency, which would get worse if you have to sequentially generate more output

2. These models very roughly turn tokens -> "average meaning" on the embedding layer, followed by attention layers that combine the meanings, and feed forward layers that match the current meaning combination to some kind of learned archetype/prototype almost. When you move from word parts to characters all of that becomes more confusing (what's the average meaning of a?) and so I don't think there are good enough techniques to learn character-based models yet

link

novaRom 806 days ago

In AI music generation we have much better results with large vocabulary sizes of 10^6 order, my uneducated guess is that's because transformers are not universal pattern recognizers, they can catch patterns on a certain granularity level only.

link