Hacker News new | ask | show | jobs
by timestretch 1261 days ago
There is a lot of ongoing research into making language models that can run well on a wider variety of hardware. It seems VRAM is the main limitation at this point.

You can already run smaller language models on your own hardware if you have a GPU with sufficient VRAM. For example, with quantization, you can run gpt-neox-20b (512 token context window) or gpt-pythia-13b (full context window) on an RTX 3090 with 24GB VRAM. Quantization allows you to run the model with less memory, where each parameter utilizes 8 bits or 4 bits instead of 16 or 32 bits.

Another possibility is to use reinforcement learning with human feedback to tune smaller models to give results comparable to larger models.

I've also been using RWKV with good results. It is a language model that uses an RNN and only needs matrix-vector multiplication instead of matrix-matrix, so inference runs much faster. The 7B model uses about 14GB VRAM without quantization. A 14B model is currently in training, but progress checkpoints are available. You can also do inference on a CPU, although it is much slower than GPU.