I'm actually using Ollama for it's Rest API endpoint. Llama.cpp does now have it's server implementation. Unfortunately they do have different endpoints and behave a little differently.
Some like c0sogi/llama-api are pretty neat because they support concurrency, and supports multiple backends (llama.cpp and Exllama, although it could be expanded).
While you might lose out on some low-level configurability, being able to easily swap between OpenAI and local models is a big win in my book.
Ooba is my favorite. It automatically converts chats to single prompts using the model-specific (finicky) formatting specs (e.g [INST] [/INST] etc for llama2) so that you can directly submit chat dialogs to the endpoint. This is a subtle point not obvious to many and I wrote about it here —
To me a killer feature would be easily running different models simultaneously such as one for embeddings and another for completion (e.g. Chat). This likely can be done already by specifying the model parameter in Ollama (and others) but I've not explored it much yet.
Some like c0sogi/llama-api are pretty neat because they support concurrency, and supports multiple backends (llama.cpp and Exllama, although it could be expanded).
While you might lose out on some low-level configurability, being able to easily swap between OpenAI and local models is a big win in my book.