| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by slimsag 607 days ago

Has anyone worked on making tokens 'clusters of words with specific semantic meaning'?

e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?

Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?

3 comments

pizza 607 days ago

I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.

link

visarga 607 days ago

yes, look up Byte Pair Encoding

https://huggingface.co/learn/nlp-course/chapter6/5

link

dragonwriter 607 days ago

Much larger tokens require a much larger token vocabulary.

link