| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jessenaser 958 days ago

The thing is why does the GPT-4 Turbo and the Updated GPT 3.5 Turbo have only an output of 4,096 tokens?

Previous Model: gpt-3.5-turbo-16k, 16385 tokens context and completion (shared)

New Model: gpt-3.5-turbo-1106, 16385 tokens context, 4096 tokens completion

Previous Model: gpt-4, 8192 tokens context and completion (shared)

New Model: gpt-4-1106-preview, 128000 tokens context, 4096 tokens completion

Why would the same size of a 16K GPT-3.5 model now not allow larger completion sizes?

Why would the new GPT-4 reduce the completion tokens as well, gpt-4 can do 8192 and gpt-4-32k can do 32768 completion tokens. Now the limit is 4096.

So you would need to change the way you prompt (split the responses) to be able to get a longer response.

---

So are these new models taking the old base models of 4K tokens context and completion and changing the context to 128000 but leaving the completion the same? If they could get gpt-4 to have gpt-4-8k and gpt-4-32k, why couldn't have it been 128000 context and 32768 completion?

3 comments

srdjanr 958 days ago

Probably because it's too expensive. Prompt can be processed quickly but output tokens are much slower (and that means more expensive).

From my local test on a 13B model, output tokens are 20-30x more expensive than input tokens. So OpenAI's pricing structure is based on expectation that there's much more input than output tokens in an average response. It didn't matter too much if a small percentage of requests used all 4k tokens for output, but with 128k it's a different story.

link

Racing0461 958 days ago

I believe openai wants to lower the time it takes for requests to finish to be able to accept more requests per server/gpu. ie money.

link

qup 958 days ago

if i'm not mistaken, the model has to be trained for a specific context window

link

refulgentis 958 days ago

More or less, like there's stuff you can do to extend the window of an existing model fairly easily, i.e. LoRA type training budget, O($1000).

But in practice, even when context_size max output token count was enabled, it simply couldn't make use of it, no matter how many prompt engineering tricks I threw at it.[1] And I've heard anecdotally that it's true for that LoRA-type technique.

[1] TL;DR, about 1/5th the actual length: write 100 pages, 3 paragraphs each, number the pages as you go and write 1 page at a time until 100. Also write out "I have written page N and need to write 100 pages total" after each page.

Inevitably it would "get tired" and be like "end page 23...now page 100"

link