| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andblac 223 days ago
	The "ALL CAPS" part of your comment got me thinking. I imagine most llms understand subtle meanings of upper case text use depending on context. But, as I understand it, ALL CAPS text will tokenize differently than lower case text. Is that right? In that case, won't the upper case be harder to understand and follow for most models since it's less common in datasets?

1 comments

minimaxir 223 days ago

There's more than enough ALL CAPS text in the corpus of the entire internet, and enough semantic context associated with it for it to be intended to be in the imperative voice.

link

miohtama 223 days ago

Shouldn't all caps normalised to tokens like low caps? There are no separate tokens for all caps and low caps in Llama, or at least not in the past.

link

minimaxir 223 days ago

Looking at the tokenizer for the older Llama 2 model, the tokenizer has capital letters in it: https://huggingface.co/meta-llama/Llama-2-7b-hf

link