| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by SnowflakeOnIce 706 days ago

> you can get 100% GPU utilization by just reading/writing to memory while doing 0 computations

Indeed! Utilization is a proxy for what you actually want (which is good use of available hardware). 100% GPU utilization doesn't actually indicate this.

On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.

5 comments

tanelpoder 705 days ago

This reminds me of the Linux/Unix disk busy "%util" metric in tools like sar and iostat. People sometimes interpret the 100%util as a physical ceiling for the disk IO capacity, just like with CPUs ("we need more disks to get disk I/O utilization down!").

It is a correct metric when your block device has a single physical spinning disk that can only accept one request at a time (dispatch queue depth=1). But the moment you deal with SSDs (capable of highly concurrent NAND IO), SAN storage block devices striped over many physical disks or even a single spinning disk that can internally queue and reorder IOs for more efficient seeking, just hitting 100%util at the host block device level doesn't mean that you've hit some IOPS ceiling.

So, looks like the GPU "SM efficiency" analysis is somewhat like logging in to the storage array itself and checking how busy each physical disk (or at least each disk controller) inside that storage array is.

link

serial_dev 705 days ago

This sounds like the good old "having high test coverage is bad because I can get to 100% just by calling functions and doing nothing, asserting nothing with them".

100% test coverage doesn't mean your tests are good, but having 50% (or pick your number) means they are bad / not sufficient.

link

heavenlyblue 704 days ago

That isn't even necessarily true. For interpreted languages having a test that just runs code asserts that the code is able to run (i.e. you are not calling a string object as a function for example). Which is not enough to always assert functionality but still better than nothing.

link

HPsquared 705 days ago

In other words it's "necessary, but not sufficient".

link

roanakb 706 days ago

Yup, similar to SM efficiency in that sense too. If you aren't seeing >80%, there is certainly time left on the table. But getting a high SM efficiency value doesn't guarantee you're making good use of the hardware as well. (still a better proxy than GPU util though)

link

shaklee3 705 days ago

This is not true. Lots of algorithms simply can't use 100% of the GPU even though they're written as optimal as possible. FFT is one.

link

defrost 705 days ago

In remote sensing | computation physicas applications it's rare to have a single FFT to compute (whatever algorithm is chosen).

Hence the practice of stuffing many FFT's through GPU grids in parallel and working to max out the hardware usage in order to increase application throughput.

eg:

https://arxiv.org/pdf/1707.07263

https://ieeexplore.ieee.org/document/9835388

link

shaklee3 705 days ago

I don't mean a single fft. I mean the fft algorithms are inherently not going to use the GPU at 100% utilization by any metric.

link

mpreda 705 days ago

Not so inherently IMO.

What I mean is: where did you take that from? I program FFTs on GPUs, and I see no reason for the "inherently can't reach 100% utilization by any metric".

link

lights0123 705 days ago

I interpret that comment as you're not going to be using every silicon block that the GPU provides, like video codecs and rasterizing. If you've maxed out compute without going over the power budget, for example, you'd likely still be able to decode video if the GPU has a separate block for it.

link

defrost 704 days ago

I had a similar read .. I packed a lot of parallel FFT's and other processing into custom TI DSP cards but the DSP family chips were RISC and carried little 'baggage' - just fat fat 32 bit | 64 bit floating point pipelines with instruction sets optimised for modular ring indexing of scalar | vector operations.

Even then they ran @ 80% "by design" for expected hard real time usage .. they only went to 11 and dropped results in toast until they smoke tests and with operators that redlined limits (and got feedback to that effect).

link

shaklee3 705 days ago

I'd be curious to see how you can do it. Try launching an fft of any size and batches and see if you can hit 100%

link

jorvi 705 days ago

> On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.

Some of us like having more than 2 hours of battery life, and not scalding our skin in the process of using our devices.

link