Hacker News new | ask | show | jobs
by wokwokwok 1145 days ago
There’s a lot of controversy about “7B is good enough and small enough for consumer hardware so it’s good enough fullstop”

…but, although it is true that for a fixed compute budget that these small models can have impressive results with good training data, it is also true that smaller models (7B) appear to have an upper performance bound that is beaten easily by larger well trained models.

It’s just way more expensive to train larger models.

They specifically note they are training a smaller 3B model In the future.

So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.

This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.

4 comments

Agreed. With some work, 13B runs on consumer hardware at this point. That redefines consumer to a 3090 (but hey, some depressed crypto guys are selling them. I recently got another GPU for my homelab this way).

30B is within reach, with compression techniques that seem to lose very little information of the overall network. Many argue that machine learning IS fundamentally a compression technique, but the topology of the trained network turns out to be more important. Assuming an appropriate activation function after this transformation.

No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.

> That redefines consumer to a 3090

Or a beefy MacBook Pro. I recently bought one with 64gb of memory and Llama 65B infers very promptly as long as I'm using quantized weights (and the Mac's GPU).

This is very impressive. I think everyone should pay very close attention to what M1/M2 have given us.

But I’m waiting until my friends can afford it. Right now (which in this pace might mean I change my mind tonight)

…I am earnestly studying how to make this a thing anyone can install as a part of a product they can use without a subscription.

And beam size 1?
Do you know of any research that tries to take large pre-trained model and make it smaller by cutting out least activated neurons and training it a bit not to loose performance?
The entire field of ML distillation.
> They specifically note they are training a smaller 3B model In the future.

They're kidding right, there's no way that thing will be more useful than one of those flan models.

Given inference costs and ability to run on devices, there's an argument to be made for training models that are smaller than Chinchilla-optimal though, especially if you can still eek out improved performance with longer training times.