Hacker News new | ask | show | jobs
by agentifysh 55 days ago
Until there is some drastic new hardware, we are going to see a similar situation to proof of work, where a small group hordes the hardware and can collude on prices.

Difference is that the current prices have a lot of subsidies from OPM

Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.

So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.

2 comments

> Until there is some drastic new hardware

For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.

During inference the model weights are static, so they can be stored in High Bandwidth Flash (HBF) instead of High Bandwidth Memory (HBM). Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

NVIDIA GPUs are general purpose. Sure, they have "tensor cores", but that's a fraction of the die area. Google's TPUs are much more efficient for inference because they're mostly tensor cores by area, which is why Gemini's pricing is undercutting everybody else despite being a frontier model.

New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.

There's also algorithmic improvements like the recently announced Google TurboQuant.

Not to mention that pure inference doesn't need the crazy fast networking that training does, or the storage, or pretty much anything other than the tensor units and a relatively small host server that can send a bit of text back and forth.

> Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

Isn't reading from flash significantly more power intensive than reading DRAM? Anyway, the overhead of keeping weights in memory becomes negligible at scale because you're running large batches and sharding a single model over large amounts of GPU's. (And that needs the crazy fast networking to make it work, you get too much latency otherwise.)

For a given capacity of memory, Flash uses far less power than DRAM, especially when used mostly for reads.

> becomes negligible at scale

Nothing is negligible at scale! Both the cost and power draw of the HBMs is a limiting factor for the hyperscalers, to the point that Sam Altman (famously!) cornered the market and locked in something like 40% of global RAM production, driving up prices for everyone.

> sharding a single model over large amounts of GPUs

A single host server typically has 4-16 GPUs directly connected to the motherboard.

A part of the reason for sharding models between multiple GPUs is because their weights don't fit into the memory of any one card! HBF could be used to give each GPU/TPU well over a terabyte of capacity for weights.

Last but not least, the context cache needs to be stored somewhere "close" to the GPUs. Across millions of users, that's a lot of unique data with a high churn rate. HBF would allow the GPUs to keep that "warm" and ready to go for the next prompt at a much lower cost than keeping it around in DRAM and having to constantly refresh it.

> For a given capacity of memory, Flash uses far less power than DRAM, especially when used mostly for reads.

Flash has no idle power being non-volatile (whereas DRAM has refresh) but active power for reading a constantly-sized block is significantly larger for Flash. You can still use Flash profitably, but only for rather sparse and/or low-intensity reads. That probably fits things like MoE layers if the MoE is sparse enough.

Also, you can't really use flash memory (especially soldered-in HBF) for ephemeral data like the KV context for a single inference, it wears out way too quickly.

Modern flash memory, with multi-bit cells, indeed requires more power for reading than DRAM, for the same amount of data.

However, for old-style 1-bit per cell flash memory I do not see any reason for differences in power consumption for reading.

Different array designs and sense amplifier designs and CMOS fabrication processes can result in different power consumptions, but similar techniques can be applied to both kinds of memories for reducing the power consumption.

Of course, storing only 1 bit per cell instead of 3 or 4 reduces a lot the density and cost advantages of flash memory, but what remains may still be enough for what inference needs.

The basic physics of reading from Flash vs. DRAM are broadly similar, and it's true that reading from SLC flash is a bit cheaper, but you'll still need way higher voltages and reading times to read from flash compared to DRAM. It's not really the same.
Doubtful, local models are the competitive future that will keep prices down.

128GB is all you need.

A few more generations of hardware and open models will find people pretty happy doing whatever they need to on their laptop locally with big SOTA models left for special purposes. There will be a pretty big bubble burst when there aren't enough customers for $1000/month per seat needed to sustain the enormous datacenter models.

Apple will win this battle and nvidia will be second when their goals shift to workstations instead of servers.

Weird how you're leaving stuff like Strix Halo out. Also weird you think 128gb is the future with all of the research done to reduce that to something around 12GB being a target with all of these papers out now. I assume we'll end up with less general purpose models and more specific small ones swapped out for whatever work you are asking to do.
Strix Halo hasn‘t got nearly enough bandwidth, its just 256bit.
It‘s sufficient for some MoE models.
> 128GB is all you need.

My guy, look around.

They are coming for personal compute.

Where are you going to get these 128GBs? Aquaman? [0]

The ones who make RAM are inexplicably attaching their fate to the future being all LLMs only everywhere.

[0] https://www.youtube.com/watch?v=0-w-pdqwiBw

Cloud can’t make money off of you and pay more than you for the hardware at the same time.
Batch inference is much more efficient. Using the hardware round the clock is much more efficient. Cloud can absolutely pay more for hardware and still make money off you.
Cloud can pay more for RAM until all the RAM producers withdraw from the consumer market, then prices will go back down.

End users will still get access to RAM. The cloud terminal they purchase from Apple, Google, Samsung, or HP will have all the RAM it will ever need directly soldered onto it.

Doesn’t Apple place RAM directly into the SoC package? We aren’t even talking about soldering it to mother boards anymore, it is coming in with the CPU like it would as a GPU.
I was really fucking hoping we weren't at the part where "cloud terminals" doesn't seem farfetched and paranoid and yet here we are. Jesus Christ.
The next step, I think, will be a "cash for clunkers" program to permit people to trade in old computer hardware to the government—especially since operating systems that do not collect KYC data on their users will soon be illegal to operate.
Ram upgrades are happening because of ddr5. Nvme upgrades are happening because of pcie5. Prices will come down once everyone is done upgrading.
The hourly cost problem is worse for agents than single-model calls because context accumulates across steps. each tool result re-bills everything before it. Rate limits are a ceiling but the quadratic curve hits you before the ceiling does. We built Traeco to surface that curve at config time, not billing time. traeco.dev
More like RAM producers are providing supplies to the highest bidder, no? If this doesn't peter out supply will normalize at a higher but less insane price eventually.