Hacker News new | ask | show | jobs
by cthalupa 17 days ago
You can split tensors across an AMD GPU and Nvidia GPU - different architectures are not an issue. People run LLMs across some pretty crazy setups.
1 comments

It depends but you cannot directly mix for example Ampere with Ada coz the lack of support for native FP8 in Ampere.
There are a variety of inference engines that support this, regardless of whether or not there is native FP8 in Ampere - llama.cpp will do it quite happily. VLLM you can do W8A16 quant too.

There are a whole lot of ways to quantize models in general.

Yeah, you'd need to use asymmetric quantization and other software techniques.