Is this chasing impossible - not criticize, and love effort- ?
But is it -a little- really possible to run an LLM in a single machine ?
I want to believe :)
GTPQ has been the missing piece, it allows quantizing the model weights from 16 to 4 bits with only a small loss in quality.
That it turn allows running even the large 65 billion parameter version of the LLaMA model in ~33GB of RAM or VRAM.
With VRAM that requires two 24GB GPUs which is no longer completely out of reach.
The model running in the browser is a smaller version with 7 billion parameters, which is good enough for some things.
I don't get where your question is coming from, you can already run LLMs on a single machine. Checkout llama.cpp, tabby, text generation webui, gpt4all, AI Dungeon open source models like clover-edition, and know this we gpu based app.
The question comes from a kind of confusion. We know the requirements of LLMs. How can we run the hardware it is currently working on, only the big LLM, with an 11Gb graphics card? I really didn't mind!
With VRAM that requires two 24GB GPUs which is no longer completely out of reach.
The model running in the browser is a smaller version with 7 billion parameters, which is good enough for some things.