| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wolttam 112 days ago
	When you're tokenizing it does not matter really what you use (how you translate that token to-from a text string), the main thing is the overall number of tokens. XML is particularly amenable to tokenization because it is trivial to represent entire tags as a single token (or a pair of tokens, one for the open tag, one for the close). It gets a bit muddier with attributes, but you can still capture the core semantics of the tag with a single token. The model will learn that tag's attributes through training on usages of the tag.