| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by logicchains 1144 days ago
	It's not clear from the GitHub; are there any plans to eventually train the 30 or 65 billion weight LLaMA models? The 65B model seems comparable to GPT3.5 for many things, and can run fine on a beefy desktop just on CPU (CPU ram is much cheaper than GPU ram). It'd be amazing to have an open source version.

4 comments

wokwokwok 1144 days ago

There’s a lot of controversy about “7B is good enough and small enough for consumer hardware so it’s good enough fullstop”

…but, although it is true that for a fixed compute budget that these small models can have impressive results with good training data, it is also true that smaller models (7B) appear to have an upper performance bound that is beaten easily by larger well trained models.

It’s just way more expensive to train larger models.

They specifically note they are training a smaller 3B model In the future.

So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.

This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.

b33j0r 1144 days ago

Agreed. With some work, 13B runs on consumer hardware at this point. That redefines consumer to a 3090 (but hey, some depressed crypto guys are selling them. I recently got another GPU for my homelab this way).

30B is within reach, with compression techniques that seem to lose very little information of the overall network. Many argue that machine learning IS fundamentally a compression technique, but the topology of the trained network turns out to be more important. Assuming an appropriate activation function after this transformation.

No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.

mrtranscendence 1144 days ago

> That redefines consumer to a 3090

Or a beefy MacBook Pro. I recently bought one with 64gb of memory and Llama 65B infers very promptly as long as I'm using quantized weights (and the Mac's GPU).

b33j0r 1144 days ago

This is very impressive. I think everyone should pay very close attention to what M1/M2 have given us.

But I’m waiting until my friends can afford it. Right now (which in this pace might mean I change my mind tonight)

…I am earnestly studying how to make this a thing anyone can install as a part of a product they can use without a subscription.

aftbit 1144 days ago

And beam size 1?

scotty79 1144 days ago

Do you know of any research that tries to take large pre-trained model and make it smaller by cutting out least activated neurons and training it a bit not to loose performance?

sebzim4500 1144 days ago

https://arxiv.org/pdf/2301.00774.pdf

KRAKRISMOTT 1144 days ago

The entire field of ML distillation.

moffkalast 1144 days ago

> They specifically note they are training a smaller 3B model In the future.

They're kidding right, there's no way that thing will be more useful than one of those flan models.

ummonk 1144 days ago

Given inference costs and ability to run on devices, there's an argument to be made for training models that are smaller than Chinchilla-optimal though, especially if you can still eek out improved performance with longer training times.

tarruda 1144 days ago

I ran the 30b and 65b Q4 on a laptop with 64 gb of RAM (8/16 CPU). It worked but token/s was very low for it to be practically useful.

logicchains 1144 days ago

That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 256 GB of ram I get around 3 tokens/sec, which is useable if not ideal. I wonder if the difference is related to the RAM or the number of CPUs?

lhl 1144 days ago

Although there are multiple bottlenecks, my understanding (and why at a certain point, throwing more threads doesn't work) is that inference for dense LLMs are largely limited by memory bandwidth. Most desktop computers will have dual channel DDR4/DDR5 memory which will be hard pressed to get >60GB/s. A last-gen Epyc/Threadripper Pro should have 8 channel memory DDR4-3200 support, which should get you a theoretical max of 204.8 GB/s (benchmarking ends up more around 150GB/s in AIDA64).

The latest Genoa has 12 channel DDR5-4800 support (and boosted AVX-512) and I'd imagine should perform quite well, but if you primarily want to run inference on a quantized 65B model, I think you're best bang/buck (for local hardware) would be 2 x RTX 3090s (each of those has 24GB of GDDR6X w/ just shy of 1TB/s of memory bandwidth).

Ambix 1143 days ago

Yeah, it's really so bad on desktops.

With my LLaMA AVX implementation on 32bit floats [0] there no performance gain after 2 threads, so remaining 14 threads available are of no use, there no memory bandwidth to load them with work :)

[0] https://github.com/gotzmann/llama.go

nullc 1142 days ago

To the extent that you're memory bandwidth limited you should be able to do multiple inferences at once --- latency stays high but getting multiple samplings can be extremely useful for many uses and can cover up somewhat for high latency.

aljungberg 1139 days ago

To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.

logicchains 1144 days ago

Thank you, that makes sense. I had no idea that there was such a dramatic difference in memory bandwidth between desktop and server CPUs.

cjbprime 1144 days ago

The two-channel DDR5 in desktops can't even do two channels very well -- if you try to put 64GB RAM in (two dual-rank 32GB DIMMs) then you lose around 50% of the bandwidth compared to a single rank DIMM (e.g. from 8GHz to 4GHz speeds, and increased latency).

azeirah 1144 days ago

I'm following the discussions on GitHub as well as their PRs closely.

The primary bottleneck for now is compute.

They've recently made a big improvement to performance by introducing partial gpu acceleration if you compile with a gpu accelerated variant of BLAS. Either cublas (Nvidia) or CLBlast (slightly slower but supports almost everything: Nvidia, Apple, AMD, mobile, raspberry pi etc)

tarruda 1144 days ago

3 tokens/sec is a lot faster than what I experienced. Even though your CPU has a lot more cores, I think llama.cpp was not being able to make good use of more than 8 threads.

When did you test this? Maybe llama.cpp had some improvements since I used it (which was at the start of the project).

Ambix 1143 days ago

It's not about threads number, it about memory bottleneck. Sweet spot for my M1 Pro laptop is around 6 threads and 4bit model - I've managed to get 20 tokens per sec, really impressive

logicchains 1144 days ago

I tested this on the latest master. Llama.cpp has had some performance improvements, although I don't know if that'd be enough to make it 3x faster.

mrtranscendence 1144 days ago

That's just a bit faster than my MacBook Pro, for what it's worth. Which was quite expensive but I don't think AMD Epyc expensive ...

Ambix 1143 days ago

Is it Zen1 architecture? It should be much better on Zen2 and newer Epycs

simion314 1144 days ago

slow could be useful if you do not want to chat with it, and instead you could code it to do a long running job, like code review your entire project like a code analysis tool. Or summarize a lot of content.

bagels 1144 days ago

How low? I think everybody has different requirements there.

extasia 1144 days ago

I ran it on a modern desktop and was getting sub 1 token/s

asah 1144 days ago

could it parallelize across multiple PCs ?

serialx 1144 days ago

No since it’s stateful in the sense that inferencing is dependent on the past generated tokens.

GistNoesis 1144 days ago

That's why it's not parallelized along the time axis but rather along the dimension of the embedding axis.

You split the big matrices into smaller matrices to dispatch the workload. But this means you have to add some communication overhead (roughly nblayers sequential synchronisation point per token). In official LLama implementation this is done transparently using RowParallelLinear, ColumnParallelLinear, ParallelEmbedding see https://github.com/facebookresearch/llama/blob/main/llama/mo...

Transformer have multiple attention heads, that can be computed independently and then summed together to produce the output of the layer. This allow to split the parameter space among machines without having to transfer them at each iteration.

brutus1213 1144 days ago

I'm really curious how Meta, DeepMind and OpenAI make the big models work. The biggest A100 you can buy is just 80GB. And I assume the big companies use single precision floating point during training. Are they actually partitioning the big model across multiple GPU instances? If one had the hardware, how many GPUs does the biggest LLAMA take? These are systems issues and I have not read papers or blog posts on how this works. To me, this infra is very non-trivial.

nullc 1144 days ago

depends on your application, if getting many completions is useful to you then its embarrassingly parallel.

tarruda 1144 days ago

I didn't measure, but IIRC it was lower than 1 token/sec

quickthrower2 1144 days ago

If I rent an A100 what kind of speed could I expect?

GC_tris 1144 days ago

While I do not have any A100 handy right now I have an instance running on Genesis Cloud with 4x RTX 3090.

A quick, very unscientific, test using the oobabooba/text-generation-webui with some models I tried earlier gives me:

* oasst-sft-7-llama-30b (spread over 4x GPU): Output generated in 28.26 seconds (5.77 tokens/s, 163 tokens, context 55, seed 1589698825)

* llama-30b-4bit-128g (only using 1 GPU as it is so small): Output generated in 12.88 seconds (6.29 tokens/s, 81 tokens, context 308, seed 1374806153)

* llama-65b-4bit-128g (only using 2 GPU): Output generated in 33.36 seconds (3.81 tokens/s, 127 tokens, context 94, seed 512503086)

* llama (vanilla, using 4x GPU): Output generated in 5.75 seconds (4.69 tokens/s, 27 tokens, context 160, seed 1561420693)

They all feel fast enough for interactive use. If you do not have an interface that streams the output (so you can see it progressing) it might feel a bit weird if you often have to wait ~30s to get the whole output chunk.

newswasboring 1144 days ago

At least for now they are focused on 7B and then 3B[1].

[1]https://github.com/openlm-research/open_llama#future-plans

Silverback_VII 1144 days ago

I'm not sure whether the number of parameters serves as a reliable measure of quality. I believe that these models have a lot of redundant computation and could be a lot smaller without losing quality.

cubefox 1144 days ago

The Chinchilla scaling law describes, apart from the training data size, the optimal number of parameters for a given amount of computing power for training. See

https://dynomight.net/scaling/

sp332 1144 days ago

For training, yes, but these models are optimized for inference, since inference will be run many more times than training. The original Llama models were run way past chinchilla-optimal amounts of data.