Hacker News new | ask | show | jobs
by muyuu 1177 days ago
consumer hardware is a bit vague of a limitation, which I guess it's partly why people are not tracking precisely what runs on what very closely

these could be useful:

https://nixified.ai

https://github.com/Crataco/ai-guide/blob/main/guide/models.m... -> https://old.reddit.com/user/Crataco/comments/zuowi9/opensour...

https://github.com/cocktailpeanut/dalai

the 4-bit quantized version of LLaMA 13B runs on my laptop without a dedicated GPU and I guess the same would apply to quantized vicuna 13B but I haven't tried that yet (converted as in this link but for 13B instead of 7B https://github.com/ggerganov/llama.cpp#usage )

GPT4All Lora's also works, perhaps the most compelling results I've got yet in my local computer - I have to try quantized Vicuna to see how that one goes, but processing the files to get a 4bit quantized version will take many hours so I'm a bit hesitant

PS: converting 13B Llama took my laptop's i7 around 20 hours and required a large swap file on top of its 16GB of RAM

feel free to answer back if you're trying any of these things this week (later I might lose track)

1 comments

Vicuna's GitHub says that applying the delta takes 60GB of CPU RAM? Is that what you meant by large swap file?

On that note, why is any RAM needed? Can't the files be loaded and diffed chunk by chunk?

Edit: The docs for running Koala (a similar model) locally say this (about converting LLaMA to Koala):

>To facilitate training very large language models that does not fit into the main memory of a single machine, EasyLM adopt a streaming format of model checkpoint. The streaming checkpointing format is implemented in checkpoint.py. During checkpointing, the StreamingCheckpointer simply flatten a nested state dictionary into a single level dictionary, and stream the key, value pairs to a file one by one using messagepack. Because it streams the tensors one by one, the checkpointer only needs to gather one tensor from the distributed accelerators to the main memory at a time, hence saving a lot of memory.

https://github.com/young-geng/EasyLM/blob/main/docs/checkpoi...

https://github.com/young-geng/EasyLM/blob/main/docs/koala.md

Presumably the same technique can be used with Vicuna.

btw I got 4bit quantized Vicuna working in my 16GB laptop and the results seem very good, perhaps the best I got running locally so far
Did you have to diff LLaMA? Did you use EasyLM?
I found it ready-made for download, here https://huggingface.co/eachadea/ggml-vicuna-13b-4bit