Hacker News new | ask | show | jobs
by spmurrayzzz 770 days ago
> having to install Ollama + Cuda to get locally working LLM didn't felt right to me when there's all what's needed in the browser

Was there something specifically about the install that didn't feel right? I ask because ollama is just a thin go wrapper around llama.cpp (its actually starting a modified version of the llama.cpp server in the background, not even going through the go ffi, likely for perf reasons). In that that sense, you could just install the CUDA toolkit via your package manager and calling `make LLAMA_CUDA=1; ./server` from the llama.cpp repo root to get effectively the same thing in two simple steps with no extra overhead.

1 comments

I'm never gonna have my non-tech friend do any of this when they can just go to chat.openai.com and call it a day.

Most people value convenience at the expense of almost everything else when it comes to technology.

> I'm never gonna have my non-tech friend do any of this

Who was making that assertion? I certainly wasn't.

In the same way I am never going to tell my non-engineer friends to build their own todo app instead of just using something like Todoist. But if they told me they cared about data privacy/security, I'd walk them through the steps if they cared to hear them.

> Who was making that assertion? I certainly wasn't.

But you were responding to my comment, and that was the implied part in it (which I later clarified to answer your question).

> In the same way I am never going to tell my non-engineer friends to build their own todo app instead of just using something like Todoist. But if they told me they cared about data privacy/security, I'd walk them through the steps if they cared to hear them.

Fortunately for most apps there's a middle ground between “use a spyware” and “build your own”, and that's exactly why this tool is much needed for LLM in my opinion.

> Fortunately for most apps there's a middle ground between “use a spyware” and “build your own”, and that's exactly why this tool is much needed for LLM in my opinion.

Sure I understand the motivation I think, the big tradeoff is performance. If your original commentary about people privileging convenience holds true across the end-to-end user experience here, I would say that single digit tokens per second rates probably qualify as inconvenient for many folks and thus cannibalize whatever ease-of-setup value you get at the outset.

There's a reason CUDA/ROCm is needed for the acceleration, there's a ton of work put into optimization via custom kernels to get the palatable throughput/latency consumers are used to when using frontier model APIs (or GPU-accelerated local stacks).