|
|
|
|
|
by ankitmathur
1157 days ago
|
|
We'd love to help you all deploy this! 1. We just released a couple models that are much smaller (https://huggingface.co/databricks/dolly-v2-6-9b), and these should be much easier to run on commodity hardware in a reasonable amount of time. 2. Regarding this particular issue, I suspect something is wrong with the setup. The example is generating a little over 100 words, which probably is something like 250 tokens. 12 minutes makes no sense for that if you're running on a modern GPU. I'd love to see details on which GPU was selected - I'm unfamiliar with which modern GPU has 30GB of memory (A10 is 24GB, T4 is 16GB, and A100 is 40/80GB). Are you sure you're using a version of PyTorch that installs CUDA correctly? 3. We have seen single GPU inference work in 8-bit on the A10, so I'd suggest that as a followup |
|
import torch from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch from transformers import pipeline
generate_text = pipeline(model="databricks/dolly-v2-6-9b", torch_dtype=torch.bfloat16, trust_remote_code=True, device=0) generate_text("Explain to me the difference between nuclear fission and fusion.")
Causes the kernel to crash, GPU should be plenty
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro P6000 Off | 00000000:00:05.0 Off | Off | | 26% 45C P8 10W / 250W | 6589MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
I'm extremely excited to try these models but they are by far the most difficult experience I've ever had trying to do basic inference.