|
|
|
|
|
by volta83
1655 days ago
|
|
I wonder how they manage to keep the FP64 units busy. Seems this is an HPC product, but many HPC apps are memory bound. So to improve FP64 perf by 4 one might need to improve DRAM bandwidth by 8-16x. Otherwise the units would only be stalled waiting for memory. But it seems they did not improve bandwidth by much? |
|
E.g. matrix multiplication of n×n square matrices has computational cost of n³ but bandwidth cost of n². Usuall a big m x m matrix is split into many blocks of n×n matrices (with m = k×n). If a n×n matrix fits into the local store of your CPU (cache or registers), then bandwidth cost for the m x m matrix product is k³×n×n = m×m×m/n, so the bigger the block-size 'n' that you can process inside the CPU, the less bandwidth you need.
edit: formatting