| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chaxor 1145 days ago
	It's not a character based model (likely - although it's closed source so anything is technically possible behind the scenes) so this makes some sense. The system can infer some relationships, which may be why 'agy' is conflated with 'agi' interestingly, but the tokenization process yields sequences of 'symbols' or indexes that are decided to English - so the system has a more difficult task when asked about 'e's (probably something like token 4893) and has to determine which tokens (e.g. [358,284840, 58292, 4830104, 57282, 4829193, 58282, 384, 24945] contain 'e's or token 4893). None of them do directly it seems - but 58292 may be 'ee' - so you would get this wrong as well.