|
|
|
|
|
by wolttam
112 days ago
|
|
When you're tokenizing it does not matter really what you use (how you translate that token to-from a text string), the main thing is the overall number of tokens. XML is particularly amenable to tokenization because it is trivial to represent entire tags as a single token (or a pair of tokens, one for the open tag, one for the close). It gets a bit muddier with attributes, but you can still capture the core semantics of the tag with a single token. The model will learn that tag's attributes through training on usages of the tag. |
|