| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Veserv 52 days ago

High bandwidth memory (HBM) can deliver TB/s of memory bandwidth and has completely shattered the memory wall for individual cores/compute elements. The only way for compute to keep up is going wide and parallel as seen in GPUs.

Despite this, massively increased memory bandwidth does not translate to material performance improvements on non-parallel compute tasks because few tasks are actually memory bandwidth bound, instead being memory latency bound.

The best known general solutions for improving memory latency are per-compute element memory caches. Unfortunately, this increases the complexity and size of your compute elements forcing you to reduce the number of compute elements, but a large number of compute elements is the only way to saturate HBM memory bandwidth.

To keep up the best known techniques are either algorithmically batch which allows you to go wide using vector/batch instructions or you go the GPU route with memory latency-hiding parallelism.

1 comments

vlovich123 51 days ago

Well…. The reason there’s such a big mismatch is the memory controller. Something like 80-90% of the energy is spent moving data in and out because of the complex addressing. If you move compute into the RAM and instead shuttle instructions in and out, you might get a huge speed up. The challenge is when an instruction references some data over there - that may end up eliminating all the advantage. But people I believe are trying to commercialize this concept.

link

zozbot234 51 days ago

> If you move compute into the RAM and instead shuttle instructions in and out, you might get a huge speed up.

Isn't that just a per-compute cache/local memory? You're proposing a scaled-up variety of NUMA where every compute core has its local memory and going outside that will cost you more.

link

vlovich123 51 days ago

Correct, you can think of this like NUMA or a distributed system where you have compute colocated with storage. It’s a special purpose accelerator for very specific problems that have been optimized to take advantage of such an architecture.

It’s also not my proposal. The industry is exploring ways to cut down the energy requirements to do AI - 80-90% of the memory consumption is just moving memory back and forth across the memory controller. It has to read a row from a bank into a row buffer, access the specific cell being requested and then shuttle it over the bus to the compute and then write the data back to the cells. The current idea is to maybe do the processing on the entire row buffer but you could imagine scaling that up to do it at the bank level. The challenge is manufacturing complexity since DRAM is made different, heat from the ALU, etc.

[1] https://semiconductor.samsung.com/news-events/tech-blog/hbm-...

link