| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cthalupa 17 days ago
	You can split tensors across an AMD GPU and Nvidia GPU - different architectures are not an issue. People run LLMs across some pretty crazy setups.

1 comments

binyu 17 days ago

It depends but you cannot directly mix for example Ampere with Ada coz the lack of support for native FP8 in Ampere.

link

cthalupa 16 days ago

There are a variety of inference engines that support this, regardless of whether or not there is native FP8 in Ampere - llama.cpp will do it quite happily. VLLM you can do W8A16 quant too.

There are a whole lot of ways to quantize models in general.

link

binyu 16 days ago

Yeah, you'd need to use asymmetric quantization and other software techniques.

link