Hacker News new | ask | show | jobs
by elijahbenizzy 500 days ago
We’ve just learned that it’s possible to do AI on less compute (deepseek). if OpenAI doesn’t scale and that’s the problem then I’d argue that in the long run, if you believe in their ability to do research, then the news this week is a very bullish sign.

IMO the equivalent of moores law for AI (both on software and hardware development) is baked into the price, which doesn’t make the valuation all too crazy.

5 comments

> We’ve just learned that it’s possible to do AI on less compute (deepseek).

There's a huge motte and bailey thing with DeepSeek conversation, where the bailey is "It only took $5.5 million!*" (* for exactly one training run for one of several models, at dirt-cheap per-hour spot prices for H100s) and the motte is all sorts of stuff.

Truth is one run for one model took 2048 GPUs fulltime for 2 months, and my experience with FAANG ML, that means it took 6 months part-time and another 1.5-2.5 runs went absolutely nowhere.

> is baked into the price, which doesn’t make the valuation all too crazy.

Valuations for most large companies have been crazy for a while now. No one values a company based on fundamentals anymore, its all pure gambling on future predictions.

This isn't unique to OpenAI by any means, but they are a good example. Last I checked their revenue to valuation multiplier was in the range of 42X. That's crazy.

Does anyone know how Deepseek does it yet?
(Summary from Reddit)

- fp8 instead of fp32 precision training = 75% less memory

- multi-token prediction to vastly speed up token output

- Mixture of Experts (MoE) so that inference only uses parts of the model not the - entire model (~37B active at a time, not the entire 671B), increases efficiency

- PTX (basically low-level assembly code) hacking in old Nvidia GPUs to pump out as much performance from their old H800 GPUs as possible

Then, the big innovation of R1 and R1-Zero was finding a way to utilize reinforcement learning within their LLM training.

They also use some kind of factorized attention that somehow leads to compression of tokens (I still haven't read their papers, so I can't be clearer than this).
Honestly, I’m not sure I’m completely sold on the value of LLMs long term but this is the most realistic and reasonable take I’ve read on this post so far.

If anything, it’s an downward adjustment in the cost implications but could actually unlock exponential improvements on a shorter time horizon than expected because of that. Investors getting scared probably is a good opportunity to buy in.

Bullish on the use. Bearish on the profit margins for the big players.

If (big if!) I understand correctly, the ceiling for edge/local/offline AI has just blown off.

Is there an acronym for edge/local/offline? ELO could be confused with something AI already dominates at. As someone working in the edge/local/offline space it’s interesting to hear these together though. Offline is local but local often isn’t offline :)
Bullish on the prospects for small players, then.
It's always been possible to "do (worse) AI on less compute". We've had years of open models! I also don't understand how anyone can see this as anything but good news for OpenAI. The ultimate value proposition of AI has always depended on whether it stretches to AGI and beyond, and R1 demonstrates that there's several orders of magnitude of hardware overhang. This makes it easier for OpenAI to succeed, not harder, because it makes it less likely that they'll scale to their financial limits and still fail to surpass humans.
The point is that this was developed outside of OpenAI.

So the real question is why does anyone believe that OpenAI will bring AGI when actual innovation was happening in some hedge fund in China while OpenAI was going on an international tour trying to drum up a trillion dollars.

Okay, that argument makes no sense to me. I thought the whole point of VC is that money is cheaper than time to market? So OpenAI didn't microoptimize their training code, sure, but they didn't need to. All the innovation of R1 is that they managed to match OpenAI's tech demo from like a year ago using considerably worse hardware by microoptimizing the hell out of it. And that's cool, full credit to them, it's a mighty impressive model. But they did it like that because they had to. It's very impressive given their constraints, but it doesn't actually advance the field.
The interesting part is that distillations based on reinforcement learning based models are performing so well. That brings the cost down dramatically to do certain tasks.
I thought the distillations were SFT only?
They're SFT on the chain of thought output of R1