How and why would the tokenizer learn that particular unicode tag was equivalent to a particular letter? I can't imagine there's a lot of text on the internet encoded in this way.
maybe it saw them used in their intended way (for flags, etc) and was able to make the association between the flags and their country codes, and then that led to it being able to interpret them as individual letters?
could also be from having been trained on unicode character tables, which contain english descriptions of each code point
could also be from having been trained on unicode character tables, which contain english descriptions of each code point