| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by trsohmers 3730 days ago
	The 170 TFLOPs number that NVIDIA gives out is for FP16, while the Top 10 list gives its number for for FP64. The P100 that makes up this NVIDIA box gives about 5.3TFLOPs per card, or a total of 42.4TFLOPs for the whole box. Sure, you can say that deep learning doesn't need FP64, but it is REALLY unfair to compare this to anything on the TOP500 list, especially when you consider the fact that this is not balanced in terms of memory size or bandwidth (in relation to the number of FLOPs) when you compare it to any real supercomputer class system.

3 comments

0x07c0 3730 days ago

Was thinking the same, but look at the memory bandwidth of this thing, 720GB/sec.* That is the number you should look at, and that is a sweet number... Also the NVLink tech looks nice for multi-GPU/heterogeneous computing (I guess the latency is the most important, but no idea how that is ). Do's some one know if the Xeons are connected to the GPUs with NVLink or are they on PCIE ? (I know the new POWER chips have NVlink but haven’t read that Intel supports it.)

*http://www.anandtech.com/show/10222/nvidia-announces-tesla-p...

link

trsohmers 3729 days ago

PCIe is the huge bottleneck... The two Xeon's in the box are 2698v3's, which each have only 40 PCIe lanes, meaning they are restricted to using 8x PCIe3 lanes per card, which would net you a whopping 8GB/s between each CPU and GPU. EDIT: Oh, and no, Intel does and probably never will support NVLink. I will eat my words (type this post up on paper and eat it) if they do in the next 5 years.

When I talk about balanced (which is a huge influence in my architectural and system level designs), I want to ideally be able to hit theoretical throughput. If we look at FP64 as an example, if I want to have sustained throughput of fused multiply adds (which is how NVIDIA always advertises their theoretical FLOP numbers as), I would be needing to move 196 data bits (three 64 bit floating point operands) in to each of my FPUs every cycle, and 64 bits out. 256 bits per cycle in a fully pipelined situation to be able to do 2 FLOPs/cycle. So if our ideal bandwidth is 16 Bytes for every 1 FLOP, if you have almost 10x more floating point capability than memory bandwidth, you are going to have a bad time (and GPUs very well reflect this on memory intensive workloads... take a look at GPUs on HPCG, they only get ~1-3% of their theoretical peak).

I'm working on my own HPC targeted chip, so obviously have some bias there, but 720GB/s memory bandwidth for a chip that is that large and using that much power isn't that impressive to me. Obviously I should wait to boast until I have my silicon in hand, but getting more than 3/4ths of that bandwidth in less than 1/10th of the power. Add in some fancy tricks and our goal is having our advertised theoretical numbers be pretty damn close to real application performance for memory intensive workloads.

link

0x07c0 3729 days ago

>PCIe is the huge bottleneck... The two Xeon's in the box..

It's a wast to put Xeon's on this things if they use the PCIe, you end up in a loot of cases only using them to drive the GPU's.

>When I talk about balanced (which is a huge influence in my architectural and system level designs)...

The DP performance on Tesla's is ridiculous, think it is a marketing ploy. People talk of buying gaming cards.., as you are almost always memory bound..

>I'm working on my own HPC targeted chip..

Looks nice, you are throwing out all HW bloat and doing everything in software? Are you planing to have some form of OS running on this chips?

link

jlebar 3730 days ago

Still, how many of these boxes are we talking about to match the performance of the #1 from the top 500 in 2005? 10? 20? That's still under $3m for 20, which is pretty impressive to me.

link

doyoulikeworms 3730 days ago

Out of curiosity, what are some problems/solutions that require FP64?

link

zevets 3730 days ago

Any finite element/volume problem. Anything integrated.

link