Hacker News new | ask | show | jobs
by maytc 607 days ago
The difference in the dates example seems right to me 20 October 2024 and 2024-20-10 are not the same.

Months in different locales can be written as yyyy-MM-dd. It can also be a catalog/reference number. So, it seems right that their embedding similarity is not perfectly aligned.

So, it's not a tokenizer problem. The text meant different things according to the LLM.