Hacker News new | ask | show | jobs
by binyu 19 days ago
The V100 and the 4090 are based on vastly different architectures, the former uses the older Volta while the latter uses Ada. Last I checked you cannot meaningfully combine them. The 3090 is better than the V100, just get two 3090 and a NVLink.
2 comments

Well I did in fact meaningfully combined them without an issue, that was the whole point of the blogpost.
Yes but it creates a bottleneck that negates the benefit of using multiple cards that way. Look into it. Cheers
Well it doesn’t matter because the bottleneck here is actually quite small for me. The issue is vram. If anything the bottleneck is my 4080.
Gotcha, I am not saying your setup is inherently wrong or useless. I am glad it works for your use cases. Godspeed
I think its a very fair thing you have flagged!
You can split tensors across an AMD GPU and Nvidia GPU - different architectures are not an issue. People run LLMs across some pretty crazy setups.
It depends but you cannot directly mix for example Ampere with Ada coz the lack of support for native FP8 in Ampere.
There are a variety of inference engines that support this, regardless of whether or not there is native FP8 in Ampere - llama.cpp will do it quite happily. VLLM you can do W8A16 quant too.

There are a whole lot of ways to quantize models in general.

Yeah, you'd need to use asymmetric quantization and other software techniques.