Hacker News new | ask | show | jobs
by hanoz 1135 days ago
The more token capacity that's added the more wasteful it seems to have to use this statelessly. Is there any avoiding this?

Wonderous as this new tech is, it seems a bit much to be paying $2 a question in a conversation about a 32k token text.

6 comments

As a human if you have to present me 32k tokens and I have to give you an answer, you would probably have to pay me more than $2
If I wanted to have a conversation about it, and you wanted to charge me a flat fee per utterance on the basis that you had to reread the text anew every time, I wouldn't be paying you at all.
If we were having such conversation via e-mail/IM and I learned that you're just asking me questions one by one in your replies, questions which you could've easily included in your first e-mail - then believe me when I say it, I would charge you the same way OpenAI does, and I'd throw in an extra 50% fee for being inconsiderate and not knowing how to communicate effectively.
> questions which you could've easily included in your first e-mail

That's not really how conversation/chat works is it?

Have you seen how lawyers bill for their time?
Yeah, I can see this being useful for one-off queries, but don't they want to offer some sort of final training ("last-mile" I called it in another comment. I can't remember what the proper term is.) to companies to customize the model so it already has all the context they need baked in to every query?
They used to offer exactly this for fine tuning models. Never offered it after ChatGPT, I think the difficulty comes with fine tuning RLHF models, not obvious how to correctly do this.
As far as I know it's not.
It's unfortunate. There are some online tutorials that instruct you to embed all your code and perform top-k cosine similarity searches, populating the responses accordingly.

It's quite interesting if you can tweak your search just right. You can even use less tokens than 8K even!

The usage needs to be for high value queries.

Using it on a simple conversation is not its intended purpose, that's like using a supercomputer to play pong.

Handle the state on the application side...

It is like complaining that HTTP is limiting because it is stateless. Build state on top of it.

I think he's talking about computational efficiency. If you're loading in 29k tokens and you're expecting to use those again, you wouldn't need to do the whole matrix multiplication song and dance again if you just kept the old buffers around for the next prompt.
I don't think this can necessarily be optimized at least with how the models work right now
You can ask multiple/multipart questions.