Hacker News new | ask | show | jobs
by brucethemoose2 987 days ago
llama.cpp (and derivative projects) is quickly becoming SOTA for many use cases, and it basically has zero dependencies.

Kobold.cpp, for example, provides an entire web UI and API with python, and 3 python packages (numpy, sentencepiece, and gguf which is the llama.cpp library). The llm itself is a single file you can get with curl or whatever. It takes less than a minute to compile against the native CPU/acclerator architecture, with nothing but the GPU libs themself, which nets better performance than a generic binary distribution.

...Its not "one line" I guess, but I can hardly imagine a simpler setup. It doesn't really need docker or a fancy container.

2 comments

Thanks - we definitely agree that llama.cpp is great. Big fan of their optimizations. We are more or less orthogonal to the engines though - in the sense that we serve as the infra/platform to run and manage those implementations easily. For example, we support running a wider range of models - for example sdxl is one single line too:

lep photon run -n sdxl -m hf:stabilityai/stable-diffusion-xl-base-1.0 --local

It's really about how to productize a wide range of models as easy as possible.

SDXL is indeed a monster to install and setup. The UIs are even worse.

IDK if the GPL license is compatible with your business, but I wonder if you could package Fooocus or Fooocus-MRE into a window? Its a hairy monster to install and run, but I've never gotten such consistently amazing results from a single prompt box + style dropdown box (including native HF diffusers and other diffusers-based frontends). The automatic augmentations to the SDXL pipine are amazing:

https://github.com/MoonRide303/Fooocus-MRE

Oh wow yeah, that is a beast. Let me give it a shot.
lepton is at a different layer comparing to llama.cpp, in fact for LLM model files that are of GGUF format, it's using llama.cpp (ctransformers to be precise) as the execution engine