| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sqs 883 days ago
	I posted about my awesome experiences using Ollama a few months ago: https://news.ycombinator.com/item?id=37662915. Ollama is definitely the easiest way to run LLMs locally, and that means it’s the best building block for applications that need to use inference. It’s like how Docker made it so any application can execute something kinda portably kinda safely on any machine. With Ollama, any application can run LLM inference on any machine. Since that post, we shipped experimental support in our product for Ollama-based local inference. We had to write our own client in TypeScript but will probably be able to switch to this instead.

5 comments

keyle 883 days ago

Could you maybe compare it to llama.cpp?

All it took for me to get going is `make` and I basically have it working locally as a console app.

link

coder543 883 days ago

Ollama is built around llama.cpp, but it automatically handles templating the chat requests to the format each model expects, and it automatically loads and unloads models on demand based on which model an API client is requesting. Ollama also handles downloading and caching models (including quantized models), so you just request them by name.

Recently, it got better (though maybe not perfect yet) at calculating how many layers of any model will fit onto the GPU, letting you get the best performance without a bunch of tedious trial and error.

Similar to Dockerfiles, ollama offers Modelfiles that you can use to tweak the existing library of models (the parameters and such), or import gguf files directly if you find a model that isn’t in the library.

Ollama is the best way I’ve found to use LLMs locally. I’m not sure how well it would fare for multiuser scenarios, but there are probably better model servers for that anyways.

Running “make” on llama.cpp is really only the first step. It’s not comparable.

link

palmfacehn 883 days ago

This is interesting. I wouldn't have given the project a deeper look without this information. The lander is ambiguous. My immediate takeaway was, "Here's yet another front end promising ease of use."

link

baq 883 days ago

I had similar feelings but last week finally tried it in WSL2.

Literally two shell commands and a largish download later I was chatting with mixtral on an aging 1070 at a positively surprising tokens/s (almost reading speed, kinda like the first chatgpt). Felt like magic.

link

regularfry 883 days ago

For me, the critical thing was that ollama got the GPU offload for Mixtral right on a single 4090, where vLLM consistently failed with out of memory issues.

It's annoying that it seems to have its own model cache, but I can live with that.

link

foxhop 882 days ago

vLLM doesn't support quantized models at this time so you need 2x 4090 to run Mixtral.

llama.cpp supports quantized models so that makes sense, ollama must have picked a quantized model to make it fit?

link

regularfry 882 days ago

Eh? The docs say vLLM supports both gptq and awq quantization. Not that it matters now I'm out of the gate, it just surprised me that it didn't work.

I'm currently running nous-hermes2-mixtral:8x7b-dpo-q4_K_M with ollama, and it's offloaded 28 of 33 layers to the GPU with nothing else running on the card. Genuinely don't know whether it's better to go for a harsher quantisation or a smaller base model at this point - it's about 20 tokens per second but the latency is annoying.

link

lolinder 883 days ago

For me the big deal with Ollama is the ease of instantly setting up a local inference API. I've got a beefy machine with a GPU downstairs, but Ollama allows me to easily use it from a Raspberry Pi on the main floor.

link

acd10j 883 days ago

In my experience award for easiest to run locally will go to llamafile models https://github.com/Mozilla-Ocho/llamafile.

link

sqs 883 days ago

Also one feature request - if the library (or another related library) could also transparently spin up a local Ollama instance if the user doesn’t have one already. “Transparent-on-demand-Ollama” or something.

link

chown 883 days ago

I have been working on something similar to that in Msty [1]. I haven’t announced the app anywhere (including my friends as I got a few things in pipeline that I want to get out first :)

[1]: https://msty.app

link

zenlikethat 883 days ago

That gets into process management which can get dicey, but I agree, a "daemonless" mode could be really interesting

link

donpdonp 883 days ago

I'd like to see a comparison to nitro https://github.com/janhq/nitro which has been fantastic for running a local LLM.

link

refulgentis 883 days ago

> Ollama is definitely the easiest way to run LLMs locally

Nitro outstripped them, 3 MB executable with OpenAI HTTP server and persistent model load

link

jmorgan 883 days ago

Persistent model loading will be possible with: https://github.com/ollama/ollama/pull/2146 – sorry it isn't yet! More to come on filesize and API improvements

link

akulbe 883 days ago

I just wanted to say thank you for being communicative and approachable and nice.

link

evantbyrne 883 days ago

Who cares about executable size when the models are measured in gigabytes lol. I would prefer a Go/Node/Python/etc server for a HTTP service even at 10x the size over some guy's bespoke c++ any day of the week. Also, measuring the size of an executable after zipping is a nonsense benchmark in of itself

link

refulgentis 882 days ago

Not some guy, agree on zip, disagree entirely with tone of the comment (what exactly separates ollama from those same exact hyperbolic descriptions?)

link