| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by algo_trader 940 days ago

Infact, as stated in the paper, this is bad news

> We therefore leave the attention layers untouched

Meaning, presumably, that the GPU memory remains the bottleneck

Flops really are quite cheap by now, e.g. vision inference chip ~$2/teraflop/s !!

4 comments

marcinzm 940 days ago

Bottleneck for larger models however this would presumably allow for cheaper models at scale or on compute constrained devices (like phones).

link

entropicdrifter 940 days ago

And potentially for distributing a model across several devices at inference time. You could devote a cluster of smaller/weaker machines to inference.

link

sroussey 940 days ago

You can do that today, the only advantage today though is being able to fix the model in memory. It’s sequential and slower due to communication costs, though batching might be faster?

link

ashirviskas 940 days ago

>Flops really are quite cheap by now, e.g. vision inference chip ~$2/teraflop/s !!

I'm really interested, can you share where you got these numbers?

link

algo_trader 940 days ago

Axelera [1] or Halio [2] give you 100-200tflop for ~$200.

8-bit ops, inference only, low memory embedded, excluding the host, implied utilization from FPS specs is ~20%

But the trend is there.

There are also newer ADAS/AV units from China which claim 1000tflops and cant really cost more than $1000/$2000 per car.

These are all tiled designed (see also dojo/tesla) heavily over-weighed on flops vs memory

[1] https://www.axelera.ai/

[2] https://hailo.ai/

link

Y_Y 940 days ago

You can't get flops on a Hailo-8, they're fixed-point only. As much as these specialised inference chips are cool, we're a long way from just being able to drop them in where a GPU was. Not to mention the memory is hugely constrained. The Hailo chips I've worked with were all limited to 20MiB for the weights which is a squeeze even at 4-bit.

link

theGnuMe 940 days ago

There's another paper replacing attention with FF networks so just combine the two and you've got something.

link

gdoug 940 days ago

Link? Sounds like a good read! :)

link

smeeth 940 days ago

Not op but might be this: https://arxiv.org/pdf/2311.10642.pdf

link

YetAnotherNick 940 days ago

> ~$2/teraflop/s

H100 is basically ~$2/(2000 tflops/s)/hour or $1 for 4*10^18 floating point operations.

link