Hacker News new | ask | show | jobs
by littlestymaar 777 days ago
This is very cool, it's something I wish existed since Llama came out, having to install Ollama + Cuda to get locally working LLM didn't felt right to me when there's all what's needed in the browser. Llamafile solves the first half of the problem, but you still need to install Cuda/ROCm for it to work with GPU acceleration. WebGPU is the way to go if we want to put AI on consumer hardware and break the oligopoly, I just wished it became more broadly available (on Linux, no browser supports it yet)
3 comments

> having to install Ollama + Cuda to get locally working LLM didn't felt right to me when there's all what's needed in the browser

Was there something specifically about the install that didn't feel right? I ask because ollama is just a thin go wrapper around llama.cpp (its actually starting a modified version of the llama.cpp server in the background, not even going through the go ffi, likely for perf reasons). In that that sense, you could just install the CUDA toolkit via your package manager and calling `make LLAMA_CUDA=1; ./server` from the llama.cpp repo root to get effectively the same thing in two simple steps with no extra overhead.

I'm never gonna have my non-tech friend do any of this when they can just go to chat.openai.com and call it a day.

Most people value convenience at the expense of almost everything else when it comes to technology.

> I'm never gonna have my non-tech friend do any of this

Who was making that assertion? I certainly wasn't.

In the same way I am never going to tell my non-engineer friends to build their own todo app instead of just using something like Todoist. But if they told me they cared about data privacy/security, I'd walk them through the steps if they cared to hear them.

> Who was making that assertion? I certainly wasn't.

But you were responding to my comment, and that was the implied part in it (which I later clarified to answer your question).

> In the same way I am never going to tell my non-engineer friends to build their own todo app instead of just using something like Todoist. But if they told me they cared about data privacy/security, I'd walk them through the steps if they cared to hear them.

Fortunately for most apps there's a middle ground between “use a spyware” and “build your own”, and that's exactly why this tool is much needed for LLM in my opinion.

> Fortunately for most apps there's a middle ground between “use a spyware” and “build your own”, and that's exactly why this tool is much needed for LLM in my opinion.

Sure I understand the motivation I think, the big tradeoff is performance. If your original commentary about people privileging convenience holds true across the end-to-end user experience here, I would say that single digit tokens per second rates probably qualify as inconvenient for many folks and thus cannibalize whatever ease-of-setup value you get at the outset.

There's a reason CUDA/ROCm is needed for the acceleration, there's a ton of work put into optimization via custom kernels to get the palatable throughput/latency consumers are used to when using frontier model APIs (or GPU-accelerated local stacks).

Tested on Ubuntu 22.04 with Chrome, sure enough, "Could not load the model because Error: Cannot find adapter that matches the request".

It really is too bad WebGPU isn't supported on Linux, I mean, that's a no-brainer right there.

Works for me.

WebGPU support is behind a couple flags on Linux: https://github.com/gpuweb/gpuweb/wiki/Implementation-Status

Awesome, thanks for pointing me here.

I tested with the flags and adding the --enable-Vulkan switch, but to no avail. But I have a somewhat non-standard setup both software and hardware, so I'm not terribly surprised. (Kubuntu 22.04 on an MSI laptop with an nvidia 3060, using proprietary non-free/blob driver 535.)

I will be playing with webGPU in the coming weeks on a number of platforms, seems like a no-brainer for the current state of AI stuff.

Likewise (same error) with Chrome on Windows.

Currently running Ollama / Open WebUI and finding lama3:8B quite useful for writing snippets of powershell, javascript, golang etc.

I get the same thing on Chrome and my last generation Intel iMac.
I've managed to avoid ollama and just toyed with lmstudio. It's non-free software, but extremely easy to get into, uses llama.cpp under the hood, cross-platform, yada yada. There's https://jan.ai/docs as well, is AGPL3, and promises inference as well as training - doubtless many other similar offerings.

I'm wary of any 'web' prefix on what could / should otherwise be desktop applications, mostly due to doubts about browser security.