| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sherlockxu 345 days ago

Hi everyone. I'm one of the maintainers of this project. We're both excited and humbled to see it on Hacker News!

We created this handbook to make LLM inference concepts more accessible, especially for developers building real-world LLM applications. The goal is to pull together scattered knowledge into something clear, practical, and easy to build on.

We’re continuing to improve it, so feedback is very welcome!

GitHub repo: https://github.com/bentoml/llm-inference-in-production

4 comments

DiabloD3 344 days ago

I'm not going to open an issue on this, but you should consider expanding on the self-hosting part of the handbook and explicitly recommend llama.cpp for local self-hosted inference.

link

leopoldj 344 days ago

The self hosting section covers corporate use case using vLlm and sglang as well as personal desktop use using Ollama which is a wrapper over llama.cpp.

link

DiabloD3 344 days ago

Recommending Ollama isn't useful for end users, its just a trap in a nice looking wrapper.

link

nl 344 days ago

Strong disagree on this. Ollama is great for moderately technical users who aren't really programmers or proficient with the command line.

link

DiabloD3 344 days ago

You can disagree all you want, but Ollama does not keep their llama.cpp vendored copy up to date, and also ships, via their mirror, completely random badly labeled models claiming to be the upstream real ones, often misappropriated from major community members (Unsloth, et al).

When you get a model offered by Ollama's service, you have no clue what you're getting, and normal people who have no experience aren't even aware of this.

Ollama is an unrestricted footgun because of this.

link

nl 344 days ago

I thought the models were like HuggingFace, where anyone can upload a model and you choose which you pull. The Unsloth ones look like this to me, eg: https://ollama.com/secfa/DeepSeek-R1-UD-IQ1_S

link

ChromaticPanic 343 days ago

Not the footgun you think it is. Ollama comes with a few things that make it convenient for casual users.

link

criemen 344 days ago

Thanks a lot for putting this together!

I have a question. In https://github.com/bentoml/llm-inference-in-production/blob/..., you have a single picture that defines TTFT and ITL. That does not match my understanding (but you guys know probably more than me): In the graphic, it looks like that the model is generating 4 tokens T0 to T3, before outputting a single output token.

I'd have expected that picture for ITL (except that then the labeling of the last box is off), but for TTFT, I'd have expected that there's only a single token T0 from the decode step, that then immediately is handed to detokenization and arrives as first output token (if we assume a streaming setup, otherwise measuring TTFT makes little sense).

link

sherlockxu 342 days ago

Thanks. We have updated the image to make it more accurate.

link

armcat 345 days ago

Amazing work on this, beautifully put together and very useful!

link

sethherr 344 days ago

This seems useful and well put together, but splitting it into many small pages instead of a single page that can be scrolled through is frustrating - particularly on mobile where the table of contents isn't shown by default. I stopped reading after a few pages because it annoyed me.

At the very least, the sections should be a single page each.

link