Hacker News new | ask | show | jobs
by vlovich123 497 days ago
Unless GPUs work markedly differently somehow or there’s been some fundamental shift in computer architecture I’m not aware of, spatial locality is still a factor in computers.

Aside from HW acceleration today, designs like Cebras would benefit heavily by reducing the amount of random access from accessing the weights (and thus freeing up cross-chip memory bandwidth for other things).

2 comments

This makes me remember game developers back when games could still be played directly from the physical disc. They would often duplicate data to different parts of the disc, knowing that certain data would often be streamed from disc together, so that seek times were minimized.

But those game devs knew where everything was spatially on the disc, and how the data would generally be used during gameplay. It was consistent.

Do engineers have a lot of insight into how models get loaded spatially onto a given GPU at run time? Is this constant? Is it variable on a per GPU basis? I would think it would have to be.

Hard to optimize for this.

This brings to mind The Story of Mel from programming folklore.

http://beza1e1.tuxen.de/lore/story_of_mel.html

Such a good read - some people really are on another level in their chosen field.
Right now models have no structure so that access is random but you definitely know where the data is located in memory since you put it there. It doesn’t matter about the physical location - it’s all through a TLB but if you ask the GPU for a contiguos memory allocation it gives it to you. This is probable the absolute easiest thing to optimize for if your data access pattern is amenable to it.
That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.
Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.
In this paper, we don't zero out the weights. We remove them.
Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?