| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ghshephard 3753 days ago

What I'm trying to figure out, is whether a teraflop directly comparable. That is, on the top 500 list, the first 4.9 Teraflop computer was in 2000, but does that mean that Pascal could provide performance similar to the Supercomputer on the LINPACK benchmark?

In describing the benchmark, they say,

In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the system of equations in the benchmark procedure must conform to LU factorization with partial pivoting. In particular, the operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen's Method" or algorithms which compute a solution in a precision lower than full precision (64 bit floating point arithmetic) and refine the solution using an iterative approach.

So, to summarize, if in 2000 the fastest supercomputer on the planet ran at about 4.9 TFLOPs, does that mean, apples-apples on the LINPACK (and only the LINPACK), that Pascal today would outperform that Supercomputer?

1 comments

stuntprogrammer 3753 days ago

Technically, yes, a teraflop is a teraflop, and is directly comparable. It just means you can do an awful lot of floating point operations per second. But many systems are sensitive to memory size, memory bandwidth, and as a result communication costs (i.e. latency/bandwidth of the interconnect between machines).

The benchmark is essentially bottlenecked on FP64 matrix multiplies. If that's what you need to do, then sure, it's indicative.

Some machine learning workloads are also bottlenecked on matrix multiply, but don't need FP64 precision. They can use FP16. Fits a bigger model in a given memory size, makes better use of memory bandwidth, and given the right hardware support, you can get extremely high rates as on Pascal.

Personally, I find the memory system on Pascal more interesting than raw flops rate. Also, the use of nvlink to link multiple GPUs..

link

dr_zoidberg 3753 days ago

I agree on the memory model being the most interesting thing about this card. I sort of "under-sold" it on the "better design" part of my last bullet.

People/manufacturers tend to look at clock rates, fill rates (for GPUs), FLOPs, "crunching power" in general, forgetting completely the memory part. For example, today most CPUs end up being bound by cache sizes and performance tuning focuses on being nice on the cache rather than being optimal in your instructions (see for example Abrash's Pixomatic articles[0-2], which are about high performance assembly programming in "modern environments").

With GPU and "classic" HPC (don't know about the new systems with the "compute fabric interconnects"), memory usually becomes the bottleneck (except for embarrasingly paralell problems, of course). In fact, I'm pretty it was Cray who said that a supercomputer is a way to turn a CPU-bound problem into an IO-bound problem.

[0] http://www.drdobbs.com/architecture-and-design/optimizing-pi...

[1] http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-p...

[2] http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-p...

link

semi-extrinsic 3753 days ago

This. Anything with global interactions (i.e. low flops per byte transferred from memory to core) is poorly suited for GPUs.

There is a hierarchy of HPC-type workloads called "Colella's seven dwarves" that ranks different workloads in terms of being CPU bound or memory bandwidth bound. See also the "roofline model". Both of these heuristics are made to reason about CPUs, but are also effective for thinking about GPUs.

link