| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fazkan 622 days ago
	we have vllm in certin production instances, it is a pain for most non-nvidia related architectures. A bit of digging around and we realized that most of it is just a wrapper on top of pytorch function calls. If we can do away with batch processing with vllm supports, we can be good, this is what we did here.

1 comments

dhruvdh 622 days ago

Batching is how you get ~350 tokens/sec on Qwen 14b on vLLM (7900XTX). By running 15 requests at once.

Also, there is a Dockerfile.rocm at the root of vLLM's repo. How is it a pain?

link

fazkan 622 days ago

driver mismatch issues, we mostly use publicly available instances, so the drivers change as the instances change, according to their base image. Not saying it won't work, but it was more painful to figure out vllm, than to write a simple inference script and do it ourselves.

link