| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hodgehog11 236 days ago
	There has always been pressure to do so, but there are fundamental bottlenecks in performance when it comes to model size. What I can think of is that there may be a push toward training for exclusively search-based rewards so that the model isn't required to compress a large proportion of the internet into their weights. But this is likely to be much slower and come with initial performance costs that frontier model developers will not want to incur.

5 comments

jiggawatts 236 days ago

> exclusively search-based rewards so that the model isn't required to compress a large proportion of the internet into their weights.

That just gave me an idea! I wonder how useful (and for what) a model would be if it was trained using a two-phase approach:

1) Put the training data through an embedding model to create a giant vector index of the entire Internet.

2) Train a transformer LLM but instead only utilising its weights, it can also do lookups against the index.

Its like a MoE where one (or more) of the experts is a fuzzy google search.

The best thing is that adding up-to-date knowledge won’t require retraining the entire model!

link

Grosvenor 236 days ago

Yeah that was my unspoken assumption. The pressure here results in an entirely different approach or model architecture.

If openAI is spending $500B then someone can get ahead by spending $1B which improves the model by >0.2%

I bet there's a group or three that could improve results a lot more than 0.2% with $1B.

link

parineum 236 days ago

> so that the model isn't required to compress a large proportion of the internet into their weights.

The knowledge compressed into an LLM is a byproduct of training, not a goal. Training on internet data teaches the model to talk at all. The knowledge and ability to speak are intertwined.

link

thisrobot 236 days ago

I wonder if this maintains the natural language capabilities which are what LLM's magic to me. There is a probably some middle ground, but not having to know what expressions, or idiomatic speech an LLM will understand is really powerful from a user experience point of view.

link

UncleOxidant 236 days ago

Or maybe models that are much more task-focused? Like models that are trained on just math & coding?

link

agoodusername63 236 days ago

isn't that what the mixture of experts trick that all the big players do is? Bunch of smaller, tightly focused models

link

irthomasthomas 235 days ago

Not exactly. MoE uses a router model to select a subset of layers per token. This makes them faster but still requires the same amount of RAM.

link