Hacker News new | ask | show | jobs
by ktaube 1041 days ago
What's the cheapest way to run e.g. LLaMa2-13B and have it served as an API?

I've tried Inference Endpoints and Replicate, but both would cost more than just using the OpenAI offering.

4 comments

You can probably run it locally with llama.cpp using CPU only, but it will be slow. I have a couple year old laptop with a RTX 3060 and it runs pretty well split across the CPU and GPU.
llama.cpp has a server with a REST API that you can use: https://github.com/ggerganov/llama.cpp/tree/master/examples/...
I mean, hosting your own outside of OpenAI is mainly to avoid OpenAI accessing the data and using it for X, Y, and Z. I wouldn't roll my own if there weren't concerns about safety due to the cost and quality of the results.
I am interested in that as well. Can LLaMa2 models be deployed to VPS? (Preferable the 70B model).