Hacker News new | ask | show | jobs
by sdrg822 940 days ago
Cool. Important note:

""" One may ask whether the conditionality introduced by the use of CMM does not make FFFs incompatible with the processes and hardware already in place for dense matrix multiplication and deep learning more broadly. In short, the answer is “No, it does not, save for some increased caching complexity." """

It's hard to beat the hardware lottery!

1 comments

Infact, as stated in the paper, this is bad news

> We therefore leave the attention layers untouched

Meaning, presumably, that the GPU memory remains the bottleneck

Flops really are quite cheap by now, e.g. vision inference chip ~$2/teraflop/s !!

Bottleneck for larger models however this would presumably allow for cheaper models at scale or on compute constrained devices (like phones).
And potentially for distributing a model across several devices at inference time. You could devote a cluster of smaller/weaker machines to inference.
You can do that today, the only advantage today though is being able to fix the model in memory. It’s sequential and slower due to communication costs, though batching might be faster?
>Flops really are quite cheap by now, e.g. vision inference chip ~$2/teraflop/s !!

I'm really interested, can you share where you got these numbers?

Axelera [1] or Halio [2] give you 100-200tflop for ~$200.

8-bit ops, inference only, low memory embedded, excluding the host, implied utilization from FPS specs is ~20%

But the trend is there.

There are also newer ADAS/AV units from China which claim 1000tflops and cant really cost more than $1000/$2000 per car.

These are all tiled designed (see also dojo/tesla) heavily over-weighed on flops vs memory

[1] https://www.axelera.ai/

[2] https://hailo.ai/

You can't get flops on a Hailo-8, they're fixed-point only. As much as these specialised inference chips are cool, we're a long way from just being able to drop them in where a GPU was. Not to mention the memory is hugely constrained. The Hailo chips I've worked with were all limited to 20MiB for the weights which is a squeeze even at 4-bit.
There's another paper replacing attention with FF networks so just combine the two and you've got something.
Link? Sounds like a good read! :)
Not op but might be this: https://arxiv.org/pdf/2311.10642.pdf
> ~$2/teraflop/s

H100 is basically ~$2/(2000 tflops/s)/hour or $1 for 4*10^18 floating point operations.