Hacker News new | ask | show | jobs
by sqs 883 days ago
I posted about my awesome experiences using Ollama a few months ago: https://news.ycombinator.com/item?id=37662915. Ollama is definitely the easiest way to run LLMs locally, and that means it’s the best building block for applications that need to use inference. It’s like how Docker made it so any application can execute something kinda portably kinda safely on any machine. With Ollama, any application can run LLM inference on any machine.

Since that post, we shipped experimental support in our product for Ollama-based local inference. We had to write our own client in TypeScript but will probably be able to switch to this instead.

5 comments

Could you maybe compare it to llama.cpp?

All it took for me to get going is `make` and I basically have it working locally as a console app.

Ollama is built around llama.cpp, but it automatically handles templating the chat requests to the format each model expects, and it automatically loads and unloads models on demand based on which model an API client is requesting. Ollama also handles downloading and caching models (including quantized models), so you just request them by name.

Recently, it got better (though maybe not perfect yet) at calculating how many layers of any model will fit onto the GPU, letting you get the best performance without a bunch of tedious trial and error.

Similar to Dockerfiles, ollama offers Modelfiles that you can use to tweak the existing library of models (the parameters and such), or import gguf files directly if you find a model that isn’t in the library.

Ollama is the best way I’ve found to use LLMs locally. I’m not sure how well it would fare for multiuser scenarios, but there are probably better model servers for that anyways.

Running “make” on llama.cpp is really only the first step. It’s not comparable.

This is interesting. I wouldn't have given the project a deeper look without this information. The lander is ambiguous. My immediate takeaway was, "Here's yet another front end promising ease of use."
I had similar feelings but last week finally tried it in WSL2.

Literally two shell commands and a largish download later I was chatting with mixtral on an aging 1070 at a positively surprising tokens/s (almost reading speed, kinda like the first chatgpt). Felt like magic.

For me, the critical thing was that ollama got the GPU offload for Mixtral right on a single 4090, where vLLM consistently failed with out of memory issues.

It's annoying that it seems to have its own model cache, but I can live with that.

vLLM doesn't support quantized models at this time so you need 2x 4090 to run Mixtral.

llama.cpp supports quantized models so that makes sense, ollama must have picked a quantized model to make it fit?

Eh? The docs say vLLM supports both gptq and awq quantization. Not that it matters now I'm out of the gate, it just surprised me that it didn't work.

I'm currently running nous-hermes2-mixtral:8x7b-dpo-q4_K_M with ollama, and it's offloaded 28 of 33 layers to the GPU with nothing else running on the card. Genuinely don't know whether it's better to go for a harsher quantisation or a smaller base model at this point - it's about 20 tokens per second but the latency is annoying.

For me the big deal with Ollama is the ease of instantly setting up a local inference API. I've got a beefy machine with a GPU downstairs, but Ollama allows me to easily use it from a Raspberry Pi on the main floor.
In my experience award for easiest to run locally will go to llamafile models https://github.com/Mozilla-Ocho/llamafile.
Also one feature request - if the library (or another related library) could also transparently spin up a local Ollama instance if the user doesn’t have one already. “Transparent-on-demand-Ollama” or something.
I have been working on something similar to that in Msty [1]. I haven’t announced the app anywhere (including my friends as I got a few things in pipeline that I want to get out first :)

[1]: https://msty.app

That gets into process management which can get dicey, but I agree, a "daemonless" mode could be really interesting
I'd like to see a comparison to nitro https://github.com/janhq/nitro which has been fantastic for running a local LLM.
> Ollama is definitely the easiest way to run LLMs locally

Nitro outstripped them, 3 MB executable with OpenAI HTTP server and persistent model load

Persistent model loading will be possible with: https://github.com/ollama/ollama/pull/2146 – sorry it isn't yet! More to come on filesize and API improvements
I just wanted to say thank you for being communicative and approachable and nice.
Who cares about executable size when the models are measured in gigabytes lol. I would prefer a Go/Node/Python/etc server for a HTTP service even at 10x the size over some guy's bespoke c++ any day of the week. Also, measuring the size of an executable after zipping is a nonsense benchmark in of itself
Not some guy, agree on zip, disagree entirely with tone of the comment (what exactly separates ollama from those same exact hyperbolic descriptions?)