| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by weichiang 1095 days ago
	the cost would be depending on GPU type/serving system/traffic pattern. check out some throughput comparison from vllm's blog post https://vllm.ai/ if you serve 7B on cost-optimized GPUs (A10G/L4) and keep it busy, it can be a lot cheaper than gpt3.5 turbo. tho it's not a fair comparison as 3.5's quality is still far better.

2 comments

zhwu 1095 days ago

Great reference!

Just want to add about hosting your own LLM vs using ChatGPT. Cost is definitely a thing to consider, but it also depends on whether it is ok to share the requests to your product with OpenAI.

Also, something you cannot do with ChatGPT is to custom it with your own data, such as internal documents, etc. As shown in the blog, the model trained by ourselves can easily know its identity.

link

weichiang 1095 days ago

say using A10G ~$1.2/hr and with full utilization on vllm 112 reqs/min => per req ~$0.00018 versus gpt-3.5 turbo $0.002 per 1k token

link

npsomaratna 1094 days ago

Quick question: what would you estimate the running cost of Llama 2 70b to be? (On GPU, and assuming maximum utilization)?

link

cpill 1094 days ago

yeah, that's the real question here

link