Hacker News new | ask | show | jobs
by addaon 384 days ago
> SRAM scaling is dead

I'm /way/ outside my expertise here, so possibly-silly question. My understanding (any of which can be wrong, please correct me!) is that (a) the memory used for LLMs is dominantly parameters, which are read-only during inference; (b) SRAM scaling may be dead, but NVM scaling doesn't seem to be; (c) NVM read bandwidth scales well locally, within an order of magnitude or two of SRAM bandwidth, for wide reads; (d) although NVM isn't currently on leading-edge processes, market forces are generally pushing NVM to smaller and smaller processes for the usual cost/density/performance reasons.

Assuming that cluster of assumptions is true, does that suggest that there's a time down the road where something like a chip-scale-integrated inference chip using NVM for parameter storage solves?

1 comments

The processes used for logic chips, and the processes used for NVM are typically different. The only case I know of the industry combining them onto a single chip would be Texas Instruments’ MSP430 microcontrollers with FeRAM, but the quantities of FeRAM are incredibly small there and the process technology is ancient. It seems unlikely to me that the rest of the industry will combine the processes such that you can have both on a single wafer, but you would have better luck asking a chip designer.

That said, NVM often has a wear-out problem. This is a major disincentive for using it in place of SRAM, which is frequently written. Different types of NVM have different endurance limits, but if they did build such a chip, it is only a matter of time before it stops working.

> The only case I know of the industry combining them onto a single chip would be Texas Instruments’ MSP430 microcontrollers with FeRAM

Every microcontroller with on-chip NVM would count. Down to 45 nm, this is mostly Flash, with the exception of the MSP430's FeRAM. Below that... we have TI pushing Flash, ST pushing PCM, NXP pushing MRAM, and Infineon pushing (TSMC's) RRAM. All on processes in the 22 nm (planar) range, either today or in the near future.

> This is a major disincentive for using it in place of SRAM, which is frequently written.

But isn't parameter memory written once per model update, for silicon used for inferencing on a specific model? Even with daily writes the typical 10k - 1M allowable writes for most of the technologies above would last decades.

I had been unaware of the others. Anyway, you need writes to the KV cache for every token generated. You are going to hit that fast.