| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ktaube 1087 days ago
	What's the cheapest way to run e.g. LLaMa2-13B and have it served as an API? I've tried Inference Endpoints and Replicate, but both would cost more than just using the OpenAI offering.

4 comments

fy20 1087 days ago

You can probably run it locally with llama.cpp using CPU only, but it will be slow. I have a couple year old laptop with a RTX 3060 and it runs pretty well split across the CPU and GPU.

link

lgrammel 1087 days ago

llama.cpp has a server with a REST API that you can use: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

link

garciasn 1087 days ago

I mean, hosting your own outside of OpenAI is mainly to avoid OpenAI accessing the data and using it for X, Y, and Z. I wouldn't roll my own if there weren't concerns about safety due to the cost and quality of the results.

link

l5870uoo9y 1087 days ago

I am interested in that as well. Can LLaMa2 models be deployed to VPS? (Preferable the 70B model).

link