| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kordlessagain 1089 days ago

I've been evaluating running non-quantized models on a Google Cloud instance with various GPUs.

To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 spot instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image):

  sudo apt-get update -y
  sudo apt-get install build-essential -y
  sudo apt-get install linux-headers-$(uname -r) -y
  wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
  sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return
  sudo apt-get install python3-pip -y
  sudo pip install --upgrade huggingface_hub 
  
  # skip using token as git credential
  huggingface-cli login (for Meta model access paste token from HF[2])
  
  sudo pip install vllm # ~8 minutes

Then, edit the test code for a 7b Llama 2 model (paste into llama.py):

  from vllm import LLM
  llm = LLM(model="meta-llama/Llama-2-7b-hf")
  output = llm.generate("The capital of Brazil is called")
  print(output)

Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.

[1] https://vllm.readthedocs.io/en/latest/models/supported_model... [2] https://huggingface.co/settings/tokens

1 comments

drusepth 1088 days ago

This looks promising, after looking at Azure/AWS/GC/Linode GPU instances all day. When you say "eventually terminated", what magnitude of time are you referring to? Hours? Days? Weeks? Months? Years?

link

kordlessagain 1086 days ago

You can set it to a time, or expect to have the spot instance terminated after 24 hours. That said, Google will terminate instances as needed for the zone you deploy in, so your mileage will vary.

link