| The thing is why does the GPT-4 Turbo and the Updated GPT 3.5 Turbo have only an output of 4,096 tokens? Previous Model: gpt-3.5-turbo-16k, 16385 tokens context and completion (shared) New Model: gpt-3.5-turbo-1106, 16385 tokens context, 4096 tokens completion Previous Model: gpt-4, 8192 tokens context and completion (shared) New Model: gpt-4-1106-preview, 128000 tokens context, 4096 tokens completion Why would the same size of a 16K GPT-3.5 model now not allow larger completion sizes? Why would the new GPT-4 reduce the completion tokens as well, gpt-4 can do 8192 and gpt-4-32k can do 32768 completion tokens. Now the limit is 4096. So you would need to change the way you prompt (split the responses) to be able to get a longer response. --- So are these new models taking the old base models of 4K tokens context and completion and changing the context to 128000 but leaving the completion the same? If they could get gpt-4 to have gpt-4-8k and gpt-4-32k, why couldn't have it been 128000 context and 32768 completion? |
From my local test on a 13B model, output tokens are 20-30x more expensive than input tokens. So OpenAI's pricing structure is based on expectation that there's much more input than output tokens in an average response. It didn't matter too much if a small percentage of requests used all 4k tokens for output, but with 128k it's a different story.