Hacker News new | ask | show | jobs
by yatz 764 days ago
Assistants API is promising, but earlier versions have many issues, especially with how it calculates the costs. As per OpenAI docs, you pay for data storage, a fixed price per API call, + token usage. It sounds straightforward until you start using it.

Here is how it works. When you upload attachments, in my case a very large PDF, it chunks that PDF into small parts and stores them in a vector database. It seems like the chunking part is not that great, as every time you make a call, the system loads a large chunk or many chunks and sends them to the model along with your prompt, which inflates your per request costs to 10 times more than the prompt + response tokens combined. So, be mindful of the hidden costs and monitor your usage.

4 comments

> as every time you make a call, the system loads a large chunk or many chunks and sends them to the model along with your prompt,

This is how RAG works.

While you can come up with work-arounds like using lesser LLMs as a pre-filtering step the fact is that if you need GPT to read the doc you need GPT to read the doc.

True, this is how RAG works, but this is why I prefer to use open-source LLMs for RAG: because the token costs are less opaque and I can control how many chunks I pull fromthe database to manage my costs
I believe it will get better and more efficient as we go. On a side note, OpenAI seems to release products before they are ready and they evolve as they go.
> I believe it will get better and more efficient as we go.

Yes of course. The point remains: the LLM has to process the data somehow.

If you are concerned about costs and token usage then switch to a provider that works for your problem (Flash Gemini looks very interesting..)

Yup, this seems right. You pay for tokens no matter what. Even in other APIs. Did you know you can set an expire for files, vector stores, etc? No need to pay for long term storage on those. Also, threads are free.
There isn’t really any other way for this to work. The only way for the model to answer questions on your pdf is for the information to be somewhere in the prompt.
That might be true of specific models or specific APIs for accessing them, but I’d argue isn’t even remotely true of neural networks generally or generatively-pretrained decoder-only attention-inspired language models in particular.

Ideally if you want a model’s weights to include a credible representation of non-trivial data you want it somewhere in the training pipeline (usually earlier is better for important stuff but that’s a hubristic at best), but there’s transfer learning of various kinds, and joint losses of countless kinds (CLIP in SD-style diffusors come to mind), and fine tunes (if that doesn’t just count as transfer learning), and dimensionality reduction that is often remarkably effective, and multi-tower models like what evolved into DLRM, and I’m forgetting/omitting easily 100x the approaches I mentioned.

It’s possible I misunderstand you, so please elaborate if so?

The way they vectorized the PDF could be less efficient than simply extracting the text and dropping it into context as text. If it's a 100 MB PDF then it's probably a scanned PDF, and OpenAI is probably using an OCR model to vectorize each page directly. It seems an opaque process with room to be inefficient. So I would be interested to know if we could save on token/vector fees by preprocessing the PDF to text with our own OCR.
No, it is not a scanned PDF but a standard textual PDF with tables, bullet points, chapters, etc. Somewhat like a manual.
How large is a very large PDF?
Close to 100mb.
FWIW, that's about an order of magnitude larger than I imagined a "very large PDF" to be. That's an enormous PDF.
Are the pages complete images (scanned document) or is it 100mb of text with some images (graphs etc.) mixed in?
plain text, tables, and bulleted lists - all text, no graphs or images.