|
|
|
|
|
by minimaxir
694 days ago
|
|
Good catch: the calculators here are bizarre. For GPT-4o, a 512x512 image uses 170 tile tokens. For GPT-4o mini, a 512x512 image uses 5,667 tile tokens. How does that even work in the context of a ViT? The patches and its image encoder should be the same size/output. Since the base token counts increase proportionally (which makes even less sense) I have a hunch there's a JavaScript bug instead. |
|