| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by otterdude 1 hour ago

Not a chip CEO, but I read this article and thought that they're working on some kind of application specific chip only for serving models. Similar to how an FPGA can optimize certain tasks.

Given constant weights / biases of a Transformer / DNN you could use pipelining to feed forward calculations through the array one layer at a time. For DNN's with thousands of layers you might see 1:1 speed up per layer channel.

I doubt they would undergo this process for marginal gains.

1 comments

xdavidliu 1 hour ago

i don't understand what the second paragraph is saying.

link

nine_k 24 minutes ago

In very crude terms, AFAICT, if you have a bunch of matrix multiplications, but one of matrices (the one with model weights) doesn't change, you can seriously speed up the computation. One thing is that you don't need to re-fetch the elements of the constant matrix, you can keep it near the ALUs. Then you maybe can detect and ignore sparse / empty blocks by marking them once.

IDK how the custom hardware exploits this; would love to hear any ideas!

link

otterdude 58 minutes ago

Basically getting around the branch predictor problem with generalized compute architectures https://en.wikipedia.org/wiki/Branch_predictor

link