The price for using images as part of your prompt has indeed not changed between GPT-4o-mini and GPT-4o
Yet overall, captioning 500 images now costs me 5x less. This is because when I'm captioning an image, I'm providing both an image and a text prompt. The cost of using the image in the prompt stays the same, but the cost of the text dramatically dropped.
Good catch: the calculators here are bizarre. For GPT-4o, a 512x512 image uses 170 tile tokens. For GPT-4o mini, a 512x512 image uses 5,667 tile tokens. How does that even work in the context of a ViT? The patches and its image encoder should be the same size/output.
Since the base token counts increase proportionally (which makes even less sense) I have a hunch there's a JavaScript bug instead.
Confirmed that mini uses ~30x more tokens than base gpt-4o using same image/same prompt: { completionTokens: 46, promptTokens: 14207, totalTokens: 14253 } vs. { completionTokens: 82, promptTokens: 465, totalTokens: 547 }.
Has anyone already validated this based on billed cost? running a batch myself to check
EDIT:
Ok so I captioned 500 images in "low resolution" mode with GPT-4o-mini
Each one took approximately: "completion_tokens=84, prompt_tokens=2989, total_tokens=3073"
Reported GPT-4o-mini cost is $0.25
Using GPT-4o this would cost me $1.33 (also in "low resolution" mode), with this breakdown:
"completion_tokens=98, prompt_tokens=239, total_tokens=337"