Hacker News new | ask | show | jobs
by DalasNoin 1475 days ago
On the other hand the openai tokenizer gives me a different tokenization ap - opl - oe [0]. If you capitalize A the result is A - pop - loe. The dalle 2 paper only specifies that it uses a BPE encoding, I would assume they used the same one as for gpt3 [0] https://beta.openai.com/tokenizer
1 comments

If they use BPE dropout, then the split can be different and not unique.

And for the record, they use BPE dropout for DALLE-1, see https://arxiv.org/pdf/2102.12092.pdf

I believe they only apply it during training.
right, that is my point. It is hard to know which combination triggers the current tokenization to be interpreted as bird.