Hacker News new | ask | show | jobs
by mpreda 542 days ago
> I have 7 NVidia 4090s under my desk

I have 6 Radeon Pro VII under my desk (in a single system BTW), and they run hard for weeks until I choose to reboot e.g. for Linux kernel updates.

I bought them "new old stock" for $300 apiece. So that's $1800 for all six.

2 comments

How does the compute performance compare to 4090’s for these workloads?

(I release it will be significantly lower, just try to get as much of a comparison as is possible).

The Radeon VII is special compared to most older (and current) affordable GPUs in that it used HBM giving it memory bandwidth comparable to modern cards ~1TB/s and has reasonable FP64 (1:4) throughput instead of (1:64). So this card can still be pretty interesting for running memory bandwidth intensive FP64 workloads. Anything affordable afterward by either AMD or Nvidia crippled realistic FP64 throughput to below what a AVX-512 many-core CPU can do.
If we speak about FP64, are your loads more like fluid dynamics than ML training?
The 4090 offers 82.58 teraflops of single-precision performance compared to the Radeon Pro VII's 13.06 teraflops.
On the other hand, for double precision a Radeon Pro VII is many times faster than a RTX 4090 (due to 1:2 vs. 1:64 FP64:FP32 ratio).

Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.

Double precision is not used in either inference or training as far as I know.
Even the single precision given by the previous poster is seldom used for inference or training.

Because the previous poster had mentioned only single precision, where RTX 4090 is better, I had to complete the data with double precision, where RTX 4090 is worse, and memory bandwidth where RTX 4090 is the same, otherwise people may believe that progress in GPUs over 5 years has been much greater than it really is.

Moreover, memory bandwidth is very relevant for inference, much more relevant than FP32 throughput.

For people wondering:

Titan V: 7.8 TFLOPs

AMD Radeon Pro VII: 6.5 TFLOPs

AMD Radeon VII: 3.52 TFLOPs

4090: 1.3 TFLOPs

For inference sure, for training: no.
Are you running ml workloads or solving differential equations?

The two are rather different and one market is worth trillions, the other isn't.

I think there is some money to be made in machine learning too.