Hacker News new | ask | show | jobs
by lukecameron 1213 days ago
> it may be possible to achieve a 100× energy-efficiency advantage

Running the math on a machine with 8x A100 (enough to run today's LLMs), that would be 300w * 8gpus / 100 = 24w.

This is within striking distance of IOT and personal devices. I'm trying to imagine what a world would look like where generative text models are commodetised to the point where you can either generate text locally on your phone, or generate GBs of text in the cloud.

I have to admit it's very hard to make any sort of accurate prediction.

6 comments

Nope, they're just going to make the model 100-8000x bigger then try to double then quadruple the width of the optical transformer.
> 100× energy-efficiency advantage for running some of the largest current Transformer models, and that if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a >8,000×

Maybe I interpreted that incorrectly but I thought it's saying a 100x advantage for current large Transformer models, and 8000x advantage for future quadrillion-parameter models? I didn't include those because I suppose that size of model is quite a few years away. Admittedly this is only based on the abstract...

Need to compare this with custom silicon like Apple will be shipping. They already have the Neural Engine chip which can run Stable Diffusion, but eventually you could imagine casting a specific model instance to an ASIC (say GPT-3.5 or -4, today).

If most devices are replaced within a year or two then you get a pretty good cadence for updating your Siri model (and even more incentive for users to upgrade hardware).

You don't, because of the scaling law they say they've identified. If optical energy per MAC operation scales as 1/d, we know two things: 1) there is no electronic architecture possible that can catch it, and 2) bigger models give optical networks a bigger energy advantage.

It's possible to have a temporary lead because of constant factors, but as long as an electronic circuit has to expend a unit of energy per MAC, you'll always be able to specify a model big enough that an optical network will beat it.

1) this is a research device and a theoretical scaling law; it’s not been proven.

> We conclude that with well-engineered, large-scale optical hardware, it may be possible to achieve a 100× energy-efficiency advantage

Emphasis on may.

2) in the real world, constant factors matter (as you allude to). For example if an ASIC gets a 1000x speedup (optimistic; we saw this for BTC) it might be the better choice for this generation, but start to lose next gen and beyond. If an ASIC only gets 100x or lower then it’s not favorable this gen.

So sure, this tech might win in the long term, but I wasn’t making any categorical claims, just noting that there are multiple horses we need to track.

It would be quite foolish to dismiss custom silicon solutions based on this paper.

Stable Diffusion on the M1 runs on the GPU, not the ANE
You can run it on either now (for example, MochiDiffusion allows you to pick https://github.com/godly-devotion/MochiDiffusion#compute-uni...). Anecdotally, the GPU seems to be faster for an M1 Max or up GPU, the ANE is a touch faster on anything smaller, and more power efficient in general.
I think https://machinelearning.apple.com/research/stable-diffusion-... describes running it on the Neural Engine.
Low power LLM-chip :) Can't wait for that
It will be also at least 100x times (physically) larger, because optical wavelength is ~1000nm vs 10nm of electronic gate size. So much for personal devices.
True but I think the difference is smaller than one might expect from pure element size.

IIRC one reason we don't already have fully 3D chips is because of the heat dissipation. Reducing 2400 W to 24 W means the heat is much more tractable, which means it can be closer to volumetric than planar.

Consider a 1cm*1cm*1mm chip; with 1μ^3 elements, 1e11 per chip; with (10nm)^3 elements limited to one layer because of heat, 1e12.

Yes this is still a factor of x10, and chips are a few layers because while heat is a problem it's not a total blocker, but it's still much less than the 100^3 ratio a simple scale-up would result in.

Maybe, but then we need to make sure light does not seep into neighboring cells, will need metal shields for that, and then heat will dissipate ... maybe they could solve this in the future.
Physics noob here. But wouldn't it be possible for light to travel through a tube smaller than its wavelength? Here is something I found via a Google search: https://www.quora.com/Can-a-given-color-of-light-pass-throug...
What will happen, such a thin tube will not be able to confine the electromagnetic wave within its boundary, so most of the wave's field will propagate outside of the tube and will quickly diffract on any bumps it encounters.
>I'm trying to imagine what a world would look like where generative text models are commodetised to the point where you can either generate text locally on your phone, or generate GBs of text in the cloud.

Dead internet theory for one. Scalable spear phishing and scams. Scalable automated offensive hacking. SEO far worse than anything possible today. Mass manipulation campaigns.

Social interaction would also be strange. Every messenger and dating app able to automatically reply and suggest sophisticated messages.

You still have to store and move the model bits. I/O is gonna be a problem
I/O is already meeting those performance levels on today's technology.

You can get a hundred dollar SSD that has a read speed around 7 gigabytes per second using just over five watts. That will fill up 8x80GB in a minute and a half of load time. If your energy budget is 24 watts then install four and make it 20-25 seconds.

As far as cost, I don't know what the proposed chip would be, but $400 is 0.5% of that pile of GPUs and SSDs will only get cheaper.

Photonics can be great for I/O, nb.
move to denser forms of communication i.e. math notation
can you elaborate on what you mean?