| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by o11c 606 days ago
	Because the way LLMs work is more-or-less "for every token, read the entire matrix from memory and do math on it". Math is fast, so if you manage to use only half the bits to store each item in the matrix, you only have to do half as much work. Of course, sometimes those least-significant-bits were relied-upon in the original training.

1 comments

slimsag 606 days ago

Has anyone worked on making tokens 'clusters of words with specific semantic meaning'?

e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?

Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?

link

pizza 606 days ago

I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.

link

visarga 606 days ago

yes, look up Byte Pair Encoding

https://huggingface.co/learn/nlp-course/chapter6/5

link

dragonwriter 606 days ago

Much larger tokens require a much larger token vocabulary.

link