Hacker News new | ask | show | jobs
by ramesh1994 1126 days ago
I think parts of the write-up are great.

There are some unique assumptions being made in parts of the gist

> 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding

> 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries

I don't know how useful these numbers are if you take away the assumptions that self-hosted will work as well as API.

> 10x: Throughput improvement from batching LLM requests

I see that the write up mentions memory being a caveat to this, but it also depends on the card specs as well. Memory Bandwidth / TFLOPs offered by say 4090 is superior while having the same amount of VRAM as 3090. The caveat mentioned with token length in the gist itself makes the 10x claim not a useful rule of thumb.

1 comments

> This means it is way cheaper to look something up in a vector store than to ask an LLM to generate it. E.g. “What is the capital of Delaware?” when looked up in an neural information retrieval system costs about 5x4 less than if you asked GPT-3.5-Turbo. The cost difference compared to GPT-4 is a whopping 250x!

In a narrow use-case of a strict look-up. This seems to exaggerate the cost difference while having completely different trade-offs.