| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by binyu 19 days ago
	The V100 and the 4090 are based on vastly different architectures, the former uses the older Volta while the latter uses Ada. Last I checked you cannot meaningfully combine them. The 3090 is better than the V100, just get two 3090 and a NVLink.

2 comments

tymscar 19 days ago

Well I did in fact meaningfully combined them without an issue, that was the whole point of the blogpost.

link

binyu 18 days ago

Yes but it creates a bottleneck that negates the benefit of using multiple cards that way. Look into it. Cheers

link

tymscar 18 days ago

Well it doesn’t matter because the bottleneck here is actually quite small for me. The issue is vram. If anything the bottleneck is my 4080.

link

binyu 18 days ago

Gotcha, I am not saying your setup is inherently wrong or useless. I am glad it works for your use cases. Godspeed

link

tymscar 18 days ago

I think its a very fair thing you have flagged!

link

cthalupa 19 days ago

You can split tensors across an AMD GPU and Nvidia GPU - different architectures are not an issue. People run LLMs across some pretty crazy setups.

link

binyu 18 days ago

It depends but you cannot directly mix for example Ampere with Ada coz the lack of support for native FP8 in Ampere.

link

cthalupa 18 days ago

There are a variety of inference engines that support this, regardless of whether or not there is native FP8 in Ampere - llama.cpp will do it quite happily. VLLM you can do W8A16 quant too.

There are a whole lot of ways to quantize models in general.

link

binyu 17 days ago

Yeah, you'd need to use asymmetric quantization and other software techniques.

link