Hacker News new | ask | show | jobs
by manca 782 days ago
A lot of the answers to your question focus solely on the infra piece of the deployment process, which is just one, albeit, important piece of the puzzle.

Each model is built using some predefined model architecture and the majority of the LLMs of today are the implementation of Transformer architecture, based on the "Attention is All You Need" paper from 2017. That said, when you fine-tune a model, you usually start from a checkpoint and then using techniques like LORA or QLORA you compute new weights. You do this in your training/fine-tuning script using PyTorch, or some other framework.

Once the training is done you get the final weights -- a binary blob of floats. Now you need to use those weights back into the inference architecture of the model. You do that by using the framework which is used for training (PyTorch) to construct the inferencing pipeline. You can build your own framework/inferencing engine too if you want and try to beat PyTorch :) The pipeline will consist of things like:

- loading the model weights

- doing pre-processing on your input

- building the inference graph

- running your input (embeddings/vectors) through the graph

- generating predictions/results

Now, the execution of this pipeline can be done on GPU(s) so all the computations (matrix multiplications) are super fast and the results are generated quickly, or it can still run on good old CPUs, but much slower. Tricks like quantization of model weights can be used here to reduce the model size and speed up the execution by trading-off precision/recall.

Services like ollama, or vllm abstract away all the above steps and that's why they are very popular -- they might even allow you to bring your own (fine-tuned) model.

On top of the pure model execution, you can create a web service that will serve your model via a HTTP or gRPC endpoint. It could accept user query/input and return a JSON with the results. Then it can be incorporated in any application, or become part of another service, etc.

So, the answer is much more than "get the GPU and run with it" and I think it's important to be aware of all the steps required if you want to really understand what goes into deploying custom ML models and putting them to a good use.

1 comments

Thanks for the insightful response. This is exactly the type of answer I was looking for. What's the best way to educate myself on the end-to-end process of deploying a production grade model smartly in a cost efficient manner?
This might be asking for too much but is there a guide that explains each part of this process? Your comment made the higher level way clearer for me and I'd like to go into the weeds a bit on each of these
download llama.cpp

convert the fine tuned model into gguf format. choose a number of quantization bits such that the final gguf will fit in your free ram + vram

run the llama.cpp server binary. choose the -ngl number of graphics layers which is the max number that will not overflow your vram (i just determine it experimentally, i start with the full number of layers, divide by two if it runs out of vram, multiply by 1.5 if there is enough vram, etc)

make sure to set the temperature to 0 if you are doing facts based language conversion and not creative tasks

if it's too slow, get more vram

ollama, kobold.cpp, and just running the model yourself with a python script as described by the original commenter are also options, but the above is what i have been enjoying lately.

everyone else in this thread is saying you need gpus but this really isn't true. what you need is ram. if you are trying to get a model that can reason you really want the biggest model possible. the more ram you have the less quantized you have to make your production model. if you can batch your requests and get the result a day later, you just need as much ram as you can get and it doesn't matter how many tokens per second you get. if you are doing creative generation then this doesn't matter nearly as much. if you need realtime then it gets extremely expensive fast to get enough vram to host your whole model (assuming you want as large a model as possible for better reasoning capability)

Interesting. Thanks for the response. Do you have any resources where I can educate myself about this? How did you learn what you know about LLMs?
Well, when Llama 1 came out I signed up and downloaded it, and that led me to llama.cpp. I followed the instructions to quantize the model to fit in my graphics card. Then later when more models like llama2 and mixtral came out I would download and evaluate them.

I kept up on hacker news posts and any comments about things I didn't understand. I've also found the localllama subreddit to be a great way to learn.

Any time I saw a comment on anything I would try it, like ollama, kobold.cpp, sillytavern, textgen-webui, and more.

I also have a friend who has been into ai for many years and we always exchange links to new things. I developed a retrieval augmented generation (rag) app with him and a "transformation engine" pipeline.

So following ai stories on hn and reddit, learning through doing, and applying what I learned to real projects.

Thanks. Very cool. Have you ever tried to implement a transformer from scratch? Like in the Attention is all you need paper? Can a first/second year college student do it
Hi, I work at a startup where we train / fine tune / inference models on a gcp kubernetes cluster on some a100s.

There isn't really that much information about how to do this properly because everyone is working it out and it changes month by month. It requires a bunch of DevOps an infrastructure knowledge above and beyond the raw ml knowledge.

Your best bet is probably just to tool around and see what you can do.

Thanks!! This is really cool