| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rahimnathwani 884 days ago
	How and why would the tokenizer learn that particular unicode tag was equivalent to a particular letter? I can't imagine there's a lot of text on the internet encoded in this way.

1 comments

kevingadd 884 days ago

maybe it saw them used in their intended way (for flags, etc) and was able to make the association between the flags and their country codes, and then that led to it being able to interpret them as individual letters?

could also be from having been trained on unicode character tables, which contain english descriptions of each code point

link

rahimnathwani 884 days ago

BTW these are the character tables:

https://unicode.org/charts/PDF/UE0000.pdf

link