Hacker News new | ask | show | jobs
by mgreg 996 days ago
I'm actually using Ollama for it's Rest API endpoint. Llama.cpp does now have it's server implementation. Unfortunately they do have different endpoints and behave a little differently.

* https://github.com/jmorganca/ollama/blob/main/docs/api.md

* https://github.com/ggerganov/llama.cpp/blob/master/examples/...

1 comments

I put together a list of OpenAI API compatibility layers for local LLMs recently: https://llm-tracker.info/books/llms/page/openai-api-compatib...

Some like c0sogi/llama-api are pretty neat because they support concurrency, and supports multiple backends (llama.cpp and Exllama, although it could be expanded).

While you might lose out on some low-level configurability, being able to easily swap between OpenAI and local models is a big win in my book.

Ooba is my favorite. It automatically converts chats to single prompts using the model-specific (finicky) formatting specs (e.g [INST] [/INST] etc for llama2) so that you can directly submit chat dialogs to the endpoint. This is a subtle point not obvious to many and I wrote about it here —

https://langroid.github.io/langroid/blog/2023/09/19/language...

Very cool and thanks for sharing.

To me a killer feature would be easily running different models simultaneously such as one for embeddings and another for completion (e.g. Chat). This likely can be done already by specifying the model parameter in Ollama (and others) but I've not explored it much yet.