Hacker News new | ask | show | jobs
by rahimnathwani 884 days ago
How and why would the tokenizer learn that particular unicode tag was equivalent to a particular letter? I can't imagine there's a lot of text on the internet encoded in this way.
1 comments

maybe it saw them used in their intended way (for flags, etc) and was able to make the association between the flags and their country codes, and then that led to it being able to interpret them as individual letters?

could also be from having been trained on unicode character tables, which contain english descriptions of each code point

BTW these are the character tables:

https://unicode.org/charts/PDF/UE0000.pdf