|
|
|
|
|
by anonymoushn
62 days ago
|
|
You can't reliably obtain correct token boundaries with this method. For example, "'d" is 1 token, but the API will return "d" stuck to the next token. Weirdly this seems to be specific to the letter "d". Similar stuff happens around "<". About all caps words, some words are in the vocab in all caps, such as MERCHANTABILITY. |
|
What in particular about this method breaks correct token boundaries?
On my first read I read your comment as there are special tokens that require multiple tokens to emit, hence you can't get certain tokens emitted alone - but I don't think that's what you're getting at on a second read?
Interesting that you've found similarities between "d" and the hidden tokens for opening an xml tag, pressing caps lock and the other hidden tokens of note. I haven't run into any trouble extracting "d" tokens, is it a particular model that you see create that pattern?