Hacker News new | ask | show | jobs
by anonymoushn 62 days ago
You can't reliably obtain correct token boundaries with this method. For example, "'d" is 1 token, but the API will return "d" stuck to the next token. Weirdly this seems to be specific to the letter "d". Similar stuff happens around "<". About all caps words, some words are in the vocab in all caps, such as MERCHANTABILITY.
1 comments

Could you please elaborate a bit more for my understanding?

What in particular about this method breaks correct token boundaries?

On my first read I read your comment as there are special tokens that require multiple tokens to emit, hence you can't get certain tokens emitted alone - but I don't think that's what you're getting at on a second read?

Interesting that you've found similarities between "d" and the hidden tokens for opening an xml tag, pressing caps lock and the other hidden tokens of note. I haven't run into any trouble extracting "d" tokens, is it a particular model that you see create that pattern?

That's " 'd ".strip(), an english contraction suffix. it's 1 token, but using this echo approach you will be served the apostrophe and the subsequent letter for the first time in different steps.
I couldn't reproduce this behavior with Sonnet 4, and Sonnet 3.7 has been deprecated since I messed with this stuff. You can try tokenizing the string "<hello> </hello>"

I think the correct tokenization of the string will not have any tokens that contain mixed punctuation and letters, but the result of this approach does contain such claimed tokens.