|
|
|
|
|
by kordlessagain
1041 days ago
|
|
I've been evaluating running non-quantized models on a Google Cloud instance with various GPUs. To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 spot instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image): sudo apt-get update -y
sudo apt-get install build-essential -y
sudo apt-get install linux-headers-$(uname -r) -y
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return
sudo apt-get install python3-pip -y
sudo pip install --upgrade huggingface_hub
# skip using token as git credential
huggingface-cli login (for Meta model access paste token from HF[2])
sudo pip install vllm # ~8 minutes
Then, edit the test code for a 7b Llama 2 model (paste into llama.py): from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b-hf")
output = llm.generate("The capital of Brazil is called")
print(output)
Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.[1] https://vllm.readthedocs.io/en/latest/models/supported_model...
[2] https://huggingface.co/settings/tokens |
|