|
|
|
|
|
by ramesh1994
1126 days ago
|
|
I think parts of the write-up are great. There are some unique assumptions being made in parts of the gist > 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding > 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries I don't know how useful these numbers are if you take away the assumptions that self-hosted will work as well as API. > 10x: Throughput improvement from batching LLM requests I see that the write up mentions memory being a caveat to this, but it also depends on the card specs as well. Memory Bandwidth / TFLOPs offered by say 4090 is superior while having the same amount of VRAM as 3090. The caveat mentioned with token length in the gist itself makes the 10x claim not a useful rule of thumb. |
|
In a narrow use-case of a strict look-up. This seems to exaggerate the cost difference while having completely different trade-offs.