I think the llm utility[0] (the one from Simon, not Google) is probably the best quickstart experience you can find. Gives the option to connect to services via API or install/run local models.
As simple as
pip install llm
# add the local plugin
llm install llm-gpt4all
# Download and run a prompt against the Orca Mini 7B model
llm -m orca-mini-3b-gguf2-q4_0 'What is the capital of France?'
Alternatively, you could use the llamafile[1] which is a tiny binary runner which gets packaged ontop of the multigigabyte models. Download the llamafile and you can launch it through your terminal or a web browser.
From the llamafile page, after you download the file, you can just launch it as
./mistral-7b-instruct-v0.2.Q5_K_M.llamafile -ngl 9999 --temp 0.7 -p '[INST]Write a story about llamas[/INST]'
Follow the guide all the way until you get to "Loading our model in Oobabooga". Then ignore the rest. You can do inference in Ooba under the Notebook tab.
(You can also ignore the "enabling HTTP API" parts, but it's quite handy, it's an OpenAI-compatible API which means you can use any OpenAI-compatible web UI)
All other alternatives have only small fractions of the features that oobabooga supports. All other alternatives only support a fraction of the LLM backends that oobabooga supports, etc.
If you don't care about the details of how those model servers work, then something that abstracts out the whole process like LM Studio or Ollama is all you need.
However, if you want to get into the weeds of how this actually works, I recommend you look up model quantization and some libraries like ggml[1] that actually do that for you.
it's all pretty well put together nowadays honestly.
here's a dead simple way : (1) download LM Studio, install it[0] (2) download a model from within the client when prompted (3) have a ball.
the program is fairly intuitive, it takes care of finding the relevant files, and it can even accept addendum prompts and various ways to flavor or specialize answers.
Learn the basics there, take what you learn to a more 'industrial' playground later on.
As simple as
Alternatively, you could use the llamafile[1] which is a tiny binary runner which gets packaged ontop of the multigigabyte models. Download the llamafile and you can launch it through your terminal or a web browser.From the llamafile page, after you download the file, you can just launch it as
[0] https://llm.datasette.io/en/stable/index.html[1] https://github.com/Mozilla-Ocho/llamafile
Edit: added llm quickstart from the intro page