Hacker News new | ask | show | jobs
by kordlessagain 1041 days ago
I've been evaluating running non-quantized models on a Google Cloud instance with various GPUs.

To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 spot instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image):

  sudo apt-get update -y
  sudo apt-get install build-essential -y
  sudo apt-get install linux-headers-$(uname -r) -y
  wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
  sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return
  sudo apt-get install python3-pip -y
  sudo pip install --upgrade huggingface_hub 
  
  # skip using token as git credential
  huggingface-cli login (for Meta model access paste token from HF[2])
  
  sudo pip install vllm # ~8 minutes
Then, edit the test code for a 7b Llama 2 model (paste into llama.py):

  from vllm import LLM
  llm = LLM(model="meta-llama/Llama-2-7b-hf")
  output = llm.generate("The capital of Brazil is called")
  print(output)
Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.

[1] https://vllm.readthedocs.io/en/latest/models/supported_model... [2] https://huggingface.co/settings/tokens

1 comments

This looks promising, after looking at Azure/AWS/GC/Linode GPU instances all day. When you say "eventually terminated", what magnitude of time are you referring to? Hours? Days? Weeks? Months? Years?
You can set it to a time, or expect to have the spot instance terminated after 24 hours. That said, Google will terminate instances as needed for the zone you deploy in, so your mileage will vary.