| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by blackeyeblitzar 485 days ago
	From what I read elsewhere, this is the type of typical performance optimization for matrix math you would see when performance is critical. It’s just not been applied yet to this specific problem by other AI players since it wasn’t a necessity for other companies. But eventually everyone would probably end up here regardless.

3 comments

mitthrowaway2 485 days ago

How many people does it take to implement this? A 10% gain in performance could pay for a lot of people's salaries when your company is spending hundreds of millions on GPU clusters.

link

fulafel 485 days ago

If you think how many people who looked and failed to realize this optimization in the preceding performance efforts of the community, you could argue for quite a big number.

link

rfoo 485 days ago

Uh, three? I worked at $CORP where we had a three people sub-team, they reverse engineered most of Volta's SASS instruction encoding, built a working SASS assembler (before the open source one of course), with the ultimate goal of making GEMM / Conv faster. And they did it. Though it wasn't applied to a high-profile enough big picture so we never heard about it :>

If you don't believe me, previous open source SASS assemblers were mostly from university, they surely didn't have that many people.

link

bjourne 485 days ago

Did $CORP also release the im0lementation to make it trivial for others to replicate their work?

link

rfoo 485 days ago

I think we did release some of the optimized kernels but I don't think we have released any one with SASS black magic, at least not before I left. Already been sanctioned by BIS, better not annoy NVIDIA furthermore.

link

DannyBee 485 days ago

Actually, a number of them did. Even Google did.

link

saagarjha 485 days ago

I mean it’s not a significant change so one? But that isn’t to say anyone could do it.

link

rvz 485 days ago

Just a reminder, this is the third of many open source releases from DeepSeek that they are willing to release, and that release is a very trivial low bar for them to find optimizations when it is needed.

I guess since the majority here are blown away by the very low-level code involved, it tells me that they're likely not ready to use it or have been stuck on very high level tools that abstract this away.

link

randomNumber7 485 days ago

I tell you a secret. Most devs do something wrong when they start rolling out their own linear algebra library. Thats why people use LAPAC, BLAS, etc...

link

KeplerBoy 485 days ago

The thing is most people don't use Lapack or Blas. Most people are at higher levels of abstraction than torch.matmul.

link

rowanG077 485 days ago

Just a few of highly skilled people.

link

Bimos 485 days ago

I think most AI players rely on high performance GEMM. But most people would be satisfied with cutlass or cublas, and the others implement gemm themselves, but not necessarily use undocumented features?

link

creato 485 days ago

Using undocumented features is not rare. People reverse engineered Apple's undocumented AMX instructions on their CPU, and I know people use undocumented/private extensions for several different kinds of GPUs.

link

Zacharias030 485 days ago

I‘ve only seen it done by hedge funds so far. What were you referring to?

link