Hacker News new | ask | show | jobs
by tony_cannistra 806 days ago
Don't mean to single you out at all, but I find this comment to be a great example of how the "ML Hype" is perceived by a certain segment folks in our industry.

The development of this chip shows that it doesn't (and shouldn't!) matter to the ML teams at Meta how 'fast ML is evolving.'

Indeed what it demonstrates is that a huge, global, trillion-dollar business has operationalized an existing ML technology to the extent that they can invest into, and deploy, customized hardware for solving a business problem.

How ML "evolves" is irrelevant. They have a system which solves their problem, and they're investing in it.

3 comments

Not to mention the capabilities they developed by actually creating this and what they'll be able to do next thanks to this experience.

You've gotta learn to walk before you can run

In their defense, it’s because the article is (understandably) sparse on details about what makes the requirements of their ranking models different from image classification or LLMs. Unless you work in industry it’s unlikely you will have heard of DeepFM or ESMM or whatever Meta is using.

And building out specialized hardware does lock you in to a certain extent. Want to use more than 128GB of memory? Too bad, your $10B chip doesn’t support that.

> Want to use more than 128GB of memory? Too bad, your $10B chip doesn’t support that.

Which is probably why Meta is also buying the biggest Nvidia datacenter cards by the shipload. There is no need to run inference for a small model - say for a text-ad recommendation system - on an H100 with attendant electricity and cooling costs.

Also, like, FP tensor cores are way more expensive than fixed-point tensor cores, and with some care, it's very much practical to even train DNNs on them.

E.g. it's common to have a full-width accumulator and e.g. s16 gradients with u8 activations and s8 weights, with the FMA (MAC) chain of the tensor multiply operation post-scaled with a learned u32 factor plus follow-up "learned" notify, which effectively acts as a fixed-point factor with learned position of it's point, to re-scale the outcome to the u8 activation range.

By having the gradients by sufficiently wider, it's practical to use a straight-through estimator for backpropagation. I read a paper (kinda two, actually) a few months ago that dealt with this (IIRC one of them was more about the hardware/ASIC aspects of fixed-point tensor cores, the other more about model training experiments with existing low precision integer-MAC chips IIRC particularly for interference in mind). If requested, I can probably find it by digging through my system(s); I would have already linked it/them if the cursory search hadn't failed.

To me, it’s bizarre to see the HPC mindset taking hold again after the cloud/commodity mindset dominated the last 16 years.

You don’t always need a Ferrari to go to the store

WDYM by HPC mindset?
"The only meaningful benchmark in the world is LAPACK and only larger than ever monolithic problem instances matter, I don't know what you're talking about, 'embarrassingly parallel'? What a silly word! Serving web requests concurrently? Good for you, congratulations, but can you do parallel programming?"

Sorry if this make anyone feels bad. It certainly made myself uncomfortable typing it out though.

Roughly this. Part of it is performance fetish. Part of it is one architecture for every purpose. I can’t tell you how many times I’ve seen people run embarrassingly parallel jobs coordinated by MPI on a Cray - because somebody spent all that money on that machine. Don’t forget about Bell prize outages.