Hacker News new | ask | show | jobs
by everforward 769 days ago
You run most of these models in something that wraps them in an HTTP API. I use Ollama, which I think is the most popular but I’m not in a great position to judge. My impression is that it handles running models on CPU better.

So you’d basically install Ollama, download one of the versions of this model off HuggingFace, create a Modelfile since this isn’t in the default Ollama repo, and then Ollama can answer prompts with the model. Modelfiles are very simple, based on Dockerfiles. It takes like 15 seconds to make one if you aren’t messing with the various parameters.

Once it’s in Ollama, just get one of the various GPT plugins for VSCode and give it the Ollama URL (http://localhost:11434 by default). I use continue.dev but there are many.

Continue takes over the tab autocomplete with the LLM, and has a chat window on the right where you can use keyboard shortcuts to copy code into the prompt and ask it to edit/generate code or ask questions about existing code.

2 comments

if you can compile stuff, then looking at llama.cpp (what ollama uses) is also interesting: https://github.com/ggerganov/llama.cpp

the server is here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

And you can search for any GGUF on huggingface

Thank you so much! That sounds surprisingly straightforward. I expected a lot more fiddling to get going.

Where would I start if I wanted to use a model programmatically ? Like let's say I am building a chat bot. I have a large data set of replies I want the model to mimic, and I'd want to do this in Python. Of course, I'd probably use a different model than Granite.

This is stretching my own knowledge, so if someone else knowledgeable wants to take a stab here I would appreciate a response as well!

Before doing that, I would start basic. Pull llama3 and see what it does with your prompts. You may be surprised how much is already in there and just not need to involve your own data at all. If that doesn’t work, check HuggingFace to see if someone has already made a model/fine tune/LoRA for what you’re trying to do. There are many, eg I found a Magic The Gathering rules model the other day.

If those fails, or you just want to play with your own data, you’ll need to figure out what “mimic” means.

If the model does okay with generating content but the content is factually wrong or missing background, you may be able to just do RAG (retrieval augmented generation). Basically running your documents through an AI that converts them to embeddings (some kind of vector, I don’t understand how they work). Then when you run a query, you can search for related embeddings and pass them to the model so that it “knows” the content that was in the document. This is the easiest; open-webui (the Ollama web chat interface) has some RAG support. Danswer is open source and built from the ground up to do RAG, and has built in support for ingesting from Slack, Drive, etc, etc. OpenAI also has embedding as a service.

A step up from that is making a LoRA. To my novice eyes, LoRA’s are basically a diff of the models parameters or weights. So rather than training a whole new model, you just add deltas to an existing one. These let you “teach” the model something while preserving the base generation capabilities of the underlying model. Ie you won’t have to worry about making sure you feed it enough data that it can speak English properly, because it gets that from the base model, you only have to give it enough data to speak about whatever you’re training it on.

If that doesn’t make any sense, go check CivitAI for Stable Diffusion (image model) LoRAs. The effects are way more obvious on image AIs.

Anyways, LoRAs are trained so you’re into training there. I think HuggingFace has tools that make this easy, but I don’t know enough to say anything with confidence.

The last option, which you almost certainly don’t want, is to train a new base model like llama3. You’re starting from 0 there; you have no existing model so you will have to teach it everything. It will take a ton of data, it will take forever to train, and it will likely be much worse than even randomly clicking models on HuggingFace. Meta has spent who knows how much on Llama and it still hallucinates.

If you end up training, you’ll probably end up doing it in the cloud unless you have tons of VRAM doing nothing. Prices are pretty reasonable, I think A100s are around $2/hr. I don’t know how to gauge how long it needs to train, but I believe it’s related to the amount of data you’re training on. I believe it’s pretty reasonable for LoRAs though, I’m guesstimating in the $20-ish range?

Edit: oh, and I’m not affiliated in any way, but I found out last night that Fireworks’ new function calling model is free while it’s in beta, which is a neat/fun thing to play with. https://fireworks.ai/blog/firefunction-v1-gpt-4-level-functi... it’s also open weights if you want to run it locally, but it’s a 40B model so I can’t on my 3060

Thank you again! This is definitely something to start from!