Hacker News new | ask | show | jobs
by zgao 2 hours ago
Chip CEO here. It really depends on what "design" or "production" means. Does "design" mean that the design was complete? Does "production" mean the beginning of production, i.e. tapeout? If measuring from RTL-freeze to tapeout, this is a fairly typical (even somewhat unimpressive) timeline (accounting for some unexpected issues) for a large, complex 3nm chip. If measuring from concept (no RTL at all, block diagram of architecture) to tapeout, this is an amazing timeline. The truth is probably somewhere in between. A more concrete statement would use actual technical milestones and gates.
2 comments

Not a chip CEO, but I read this article and thought that they're working on some kind of application specific chip only for serving models. Similar to how an FPGA can optimize certain tasks.

Given constant weights / biases of a Transformer / DNN you could use pipelining to feed forward calculations through the array one layer at a time. For DNN's with thousands of layers you might see 1:1 speed up per layer channel.

I doubt they would undergo this process for marginal gains.

i don't understand what the second paragraph is saying.
In very crude terms, AFAICT, if you have a bunch of matrix multiplications, but one of matrices (the one with model weights) doesn't change, you can seriously speed up the computation. One thing is that you don't need to re-fetch the elements of the constant matrix, you can keep it near the ALUs. Then you maybe can detect and ignore sparse / empty blocks by marking them once.

IDK how the custom hardware exploits this; would love to hear any ideas!

Basically getting around the branch predictor problem with generalized compute architectures https://en.wikipedia.org/wiki/Branch_predictor
>If measuring from RTL-freeze to tapeout, this is a fairly typical (even somewhat unimpressive) timeline (accounting for some unexpected issues) for a large, complex 3nm chip.

Even for a company’s first design?

I don't think you get the newcomer novelty buff when your val approaches 13 digits.
This isn't Broadcom's first design.
Yeah, "first chip" here likely means they contracted Broadcom (or a firm with similar experience) to do all the heavy lifting. Building out your own in-house teams for this sort of thing is a decade-long project - just look how much inside Apple's early chips was licensed ARM / PowerVR cores
Apple didn't have the talent in-house until they bought Intrincity who worked with Samsung on Apple's earlier Arm chips as well. https://en.wikipedia.org/wiki/Intrinsity