| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vlovich123 544 days ago
	Unless GPUs work markedly differently somehow or there’s been some fundamental shift in computer architecture I’m not aware of, spatial locality is still a factor in computers. Aside from HW acceleration today, designs like Cebras would benefit heavily by reducing the amount of random access from accessing the weights (and thus freeing up cross-chip memory bandwidth for other things).

2 comments

whynotminot 544 days ago

This makes me remember game developers back when games could still be played directly from the physical disc. They would often duplicate data to different parts of the disc, knowing that certain data would often be streamed from disc together, so that seek times were minimized.

But those game devs knew where everything was spatially on the disc, and how the data would generally be used during gameplay. It was consistent.

Do engineers have a lot of insight into how models get loaded spatially onto a given GPU at run time? Is this constant? Is it variable on a per GPU basis? I would think it would have to be.

Hard to optimize for this.

link

jaek 544 days ago

This brings to mind The Story of Mel from programming folklore.

http://beza1e1.tuxen.de/lore/story_of_mel.html

link

abrookewood 544 days ago

Such a good read - some people really are on another level in their chosen field.

link

vlovich123 543 days ago

Right now models have no structure so that access is random but you definitely know where the data is located in memory since you put it there. It doesn’t matter about the physical location - it’s all through a TLB but if you ask the GPU for a contiguos memory allocation it gives it to you. This is probable the absolute easiest thing to optimize for if your data access pattern is amenable to it.

link

harles 544 days ago

That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.

link

vlovich123 543 days ago

Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.

link

mayukhdeb 543 days ago

In this paper, we don't zero out the weights. We remove them.

link

vlovich123 541 days ago

Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?

link