Hacker News new | ask | show | jobs
by teddykoker 1472 days ago
According to [1], the byte pair encoding for “Apoploe vesrreaitais” (the words producing bird images) is "apo, plo, e</w>, ,ve, sr, re, ait, ais</w>", and Apo-didae & Plo-ceidae are families of birds.

[1] https://twitter.com/barneyflames/status/1531736708903051265?...

1 comments

On the other hand the openai tokenizer gives me a different tokenization ap - opl - oe [0]. If you capitalize A the result is A - pop - loe. The dalle 2 paper only specifies that it uses a BPE encoding, I would assume they used the same one as for gpt3 [0] https://beta.openai.com/tokenizer
If they use BPE dropout, then the split can be different and not unique.

And for the record, they use BPE dropout for DALLE-1, see https://arxiv.org/pdf/2102.12092.pdf

I believe they only apply it during training.
right, that is my point. It is hard to know which combination triggers the current tokenization to be interpreted as bird.