Hacker News new | ask | show | jobs
OpenLLaMA: An Open Reproduction of LLaMA (github.com)
484 points by sadiq 1144 days ago
18 comments

To use with llama.cpp on CPU and 8GB RAM

  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
  python3 -m pip install -r requirements.txt

  cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
  python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
  ./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
  ./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock
You the real MVP!

Though I'm getting this error on an Intel macbook (Monterey); it works fine on a Windows11 box:

   python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
   Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
   Traceback (most recent call last):
    File "/l/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
      convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
    File "/l/llama.cpp/convert.py", line 1129, in main
       model_plus = load_some_model(args.model)
     File "/l/llama.cpp/convert.py", line 1055, in load_some_model
       models_plus.append(lazy_load_file(path))
     File "/l/llama.cpp/convert.py", line 857, in lazy_load_file
       raise ValueError(f"unknown format: {path}")
   ValueError: unknown format: models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
I had the same issue and then noticed that I need git lfs - otherwise just cloning the repo will not download the weights.
After getting the model with git lfs I get:

Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin

Traceback (most recent call last):

  File "convert-pth-to-ggml.py", line 11, in <module>
    convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
  File "/Volumes/mac/Dev/llama.cpp/convert.py", line 1145, in main
    model_plus = load_some_model(args.model)
  File "/Volumes/mac/Dev/llama.cpp/convert.py", line 1071, in load_some_model
    models_plus.append(lazy_load_file(path))
  File "/Volumes/mac/Dev/llama.cpp/convert.py", line 865, in lazy_load_file
    return lazy_load_torch_file(fp, path)
  File "/Volumes/mac/Dev/llama.cpp/convert.py", line 737, in lazy_load_torch_file
    model = unpickler.load()
TypeError: 'staticmethod' object is not callable
Thanks for the tip! After running `brew install git-lfs && git lfs install` on my Macbook, I was able to run the model.
I get the same error on an M series MacBook (Ventura). However from the repo README.md it looks like make should work instead of cmake, I’ll give that a try.
It's not clear from the GitHub; are there any plans to eventually train the 30 or 65 billion weight LLaMA models? The 65B model seems comparable to GPT3.5 for many things, and can run fine on a beefy desktop just on CPU (CPU ram is much cheaper than GPU ram). It'd be amazing to have an open source version.
There’s a lot of controversy about “7B is good enough and small enough for consumer hardware so it’s good enough fullstop”

…but, although it is true that for a fixed compute budget that these small models can have impressive results with good training data, it is also true that smaller models (7B) appear to have an upper performance bound that is beaten easily by larger well trained models.

It’s just way more expensive to train larger models.

They specifically note they are training a smaller 3B model In the future.

So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.

This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.

Agreed. With some work, 13B runs on consumer hardware at this point. That redefines consumer to a 3090 (but hey, some depressed crypto guys are selling them. I recently got another GPU for my homelab this way).

30B is within reach, with compression techniques that seem to lose very little information of the overall network. Many argue that machine learning IS fundamentally a compression technique, but the topology of the trained network turns out to be more important. Assuming an appropriate activation function after this transformation.

No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.

> That redefines consumer to a 3090

Or a beefy MacBook Pro. I recently bought one with 64gb of memory and Llama 65B infers very promptly as long as I'm using quantized weights (and the Mac's GPU).

This is very impressive. I think everyone should pay very close attention to what M1/M2 have given us.

But I’m waiting until my friends can afford it. Right now (which in this pace might mean I change my mind tonight)

…I am earnestly studying how to make this a thing anyone can install as a part of a product they can use without a subscription.

And beam size 1?
Do you know of any research that tries to take large pre-trained model and make it smaller by cutting out least activated neurons and training it a bit not to loose performance?
The entire field of ML distillation.
> They specifically note they are training a smaller 3B model In the future.

They're kidding right, there's no way that thing will be more useful than one of those flan models.

Given inference costs and ability to run on devices, there's an argument to be made for training models that are smaller than Chinchilla-optimal though, especially if you can still eek out improved performance with longer training times.
I ran the 30b and 65b Q4 on a laptop with 64 gb of RAM (8/16 CPU). It worked but token/s was very low for it to be practically useful.
That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 256 GB of ram I get around 3 tokens/sec, which is useable if not ideal. I wonder if the difference is related to the RAM or the number of CPUs?
Although there are multiple bottlenecks, my understanding (and why at a certain point, throwing more threads doesn't work) is that inference for dense LLMs are largely limited by memory bandwidth. Most desktop computers will have dual channel DDR4/DDR5 memory which will be hard pressed to get >60GB/s. A last-gen Epyc/Threadripper Pro should have 8 channel memory DDR4-3200 support, which should get you a theoretical max of 204.8 GB/s (benchmarking ends up more around 150GB/s in AIDA64).

The latest Genoa has 12 channel DDR5-4800 support (and boosted AVX-512) and I'd imagine should perform quite well, but if you primarily want to run inference on a quantized 65B model, I think you're best bang/buck (for local hardware) would be 2 x RTX 3090s (each of those has 24GB of GDDR6X w/ just shy of 1TB/s of memory bandwidth).

Yeah, it's really so bad on desktops.

With my LLaMA AVX implementation on 32bit floats [0] there no performance gain after 2 threads, so remaining 14 threads available are of no use, there no memory bandwidth to load them with work :)

[0] https://github.com/gotzmann/llama.go

To the extent that you're memory bandwidth limited you should be able to do multiple inferences at once --- latency stays high but getting multiple samplings can be extremely useful for many uses and can cover up somewhat for high latency.
To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.
Thank you, that makes sense. I had no idea that there was such a dramatic difference in memory bandwidth between desktop and server CPUs.
The two-channel DDR5 in desktops can't even do two channels very well -- if you try to put 64GB RAM in (two dual-rank 32GB DIMMs) then you lose around 50% of the bandwidth compared to a single rank DIMM (e.g. from 8GHz to 4GHz speeds, and increased latency).
I'm following the discussions on GitHub as well as their PRs closely.

The primary bottleneck for now is compute.

They've recently made a big improvement to performance by introducing partial gpu acceleration if you compile with a gpu accelerated variant of BLAS. Either cublas (Nvidia) or CLBlast (slightly slower but supports almost everything: Nvidia, Apple, AMD, mobile, raspberry pi etc)

3 tokens/sec is a lot faster than what I experienced. Even though your CPU has a lot more cores, I think llama.cpp was not being able to make good use of more than 8 threads.

When did you test this? Maybe llama.cpp had some improvements since I used it (which was at the start of the project).

It's not about threads number, it about memory bottleneck. Sweet spot for my M1 Pro laptop is around 6 threads and 4bit model - I've managed to get 20 tokens per sec, really impressive
I tested this on the latest master. Llama.cpp has had some performance improvements, although I don't know if that'd be enough to make it 3x faster.
That's just a bit faster than my MacBook Pro, for what it's worth. Which was quite expensive but I don't think AMD Epyc expensive ...
Is it Zen1 architecture? It should be much better on Zen2 and newer Epycs
slow could be useful if you do not want to chat with it, and instead you could code it to do a long running job, like code review your entire project like a code analysis tool. Or summarize a lot of content.
How low? I think everybody has different requirements there.
I ran it on a modern desktop and was getting sub 1 token/s
could it parallelize across multiple PCs ?
No since it’s stateful in the sense that inferencing is dependent on the past generated tokens.
I didn't measure, but IIRC it was lower than 1 token/sec
If I rent an A100 what kind of speed could I expect?
While I do not have any A100 handy right now I have an instance running on Genesis Cloud with 4x RTX 3090.

A quick, very unscientific, test using the oobabooba/text-generation-webui with some models I tried earlier gives me:

* oasst-sft-7-llama-30b (spread over 4x GPU): Output generated in 28.26 seconds (5.77 tokens/s, 163 tokens, context 55, seed 1589698825)

* llama-30b-4bit-128g (only using 1 GPU as it is so small): Output generated in 12.88 seconds (6.29 tokens/s, 81 tokens, context 308, seed 1374806153)

* llama-65b-4bit-128g (only using 2 GPU): Output generated in 33.36 seconds (3.81 tokens/s, 127 tokens, context 94, seed 512503086)

* llama (vanilla, using 4x GPU): Output generated in 5.75 seconds (4.69 tokens/s, 27 tokens, context 160, seed 1561420693)

They all feel fast enough for interactive use. If you do not have an interface that streams the output (so you can see it progressing) it might feel a bit weird if you often have to wait ~30s to get the whole output chunk.

At least for now they are focused on 7B and then 3B[1].

[1]https://github.com/openlm-research/open_llama#future-plans

I'm not sure whether the number of parameters serves as a reliable measure of quality. I believe that these models have a lot of redundant computation and could be a lot smaller without losing quality.
The Chinchilla scaling law describes, apart from the training data size, the optimal number of parameters for a given amount of computing power for training. See

https://dynomight.net/scaling/

For training, yes, but these models are optimized for inference, since inference will be run many more times than training. The original Llama models were run way past chinchilla-optimal amounts of data.
Does anyone have any resources they recommend for just understanding the base terminology of models like this? I always see the terms "weights", "tokens", "model", etc. I feel like I understand what these mean, but I have no idea what I need to care about them for in open models like this? If I were to download an open model to run on my machine, would I download the weights? I'm just ignorant in the ML space I guess but not sure where to start.
Psst ... why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this? Get those personalised explanations and enjoy its unlimited patience.

I have felt the same in the past, related to a completely different topic. I know how it feels, it's like people are not saying things what they are, just using weird words.

"weights" - synapses in the AI brain

"tokens" - word fragments

"model" - of course, the model is the AI brain

"context" - the model can only handle a piece of text, can't put whole books in, so this limited window is the context

"GPT" - predicts the next word, trained on everything; if you feed its last predicted word back in, it can write long texts

"LoRA" - a lightweight plug-in model for tweaking the big model

"loss" - a score telling how bad is the output

"training" - change the model until it fits the data

"quantisation" - making a low precision version of the model because it still works, but now is much faster and needs less compute

"embedding" - just a vector, it stands for the meaning of a word token or a piece of image; these embeddings are learned

But, this isn't a bad ideia when you don't know even the basics? Because you wouldn't be able to separate genuine information to subtle or not so subtle hallucinations.

It's like generating code in a language that you know nothing about. You should check for bugs, but you can't.

The first thing to learn is you can’t trust the internet. From that you’ll know not to trust gpt. If you are prone to trusting things blindly, without doing your own research or verification, you have far bigger problems than gpt “hallucinations” (frankly a terrible terminology).
I find "hallucinations" to be pretty apt. What works better in your opinion?
The neurological term for it is "Confabulation", which is a lot better than "Hallucination" as used in AI.

Confabulation is the unintended generation of false memories.

Hallucination is false perception.

Clearly, the phenomenon we are seeing with LLM researchers call Hallucination better fits Confabulation.

Lies. Bullshit. Con artistry.

It's not perceiving reality incorrectly, it's presenting wholesale fiction as fact both coherently and with absolute confidence. It even forges supporting documentation ad-hoc.

GPT is not a poor schizophrenic suffering from delusions or innocuous "hallucinations." It is the world's most advanced liar.

In my opinion people are way more afraid of hallucinations than they should be. You are not asking it to solve world hunger, this is basically like asking it to summarize Wikipedia articles. At least with GPT4 it doesn't hallucinate on basic things. I am learning typescript with it, and it hasn't given me wrong answers to direct questions yet. If you are too worried about hallucinations use something like phind.com which will give some sources.
Anyone can evaluate whether it's giving you a self-consistent set of statements, and the additional words it spits out are helpful for a traditional search for alternative sources.

IMO, so long as you're aware the information is often subtly wrong, it's not that different from, e.g., physics classes progressively lying to you less to allow your brain to build a framework to house the incoming ideas.

I think of the good things to get a sense of with ChatGPT is the types of areas where it is most and least likely to confabulate. If I asked it for an ELI5 about key concepts relating to how LLMs work, I would be highly confident it would be accurate. When you start asking about truly esoteric topics, that's when it often starts completely making things up.
I like the term "confabulation". A hallucination is an artifact of an intoxicated or malfunctioning brain. In my experience, confabulation is a common occurrence in normal brains, and can occur without intention. It's why humans make such poor witnesses. It's how the brain fills in the blanks in its senses and experience.
> Psst ... why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this?

I do not use ChatGPT as a search engine. Its ability to confidently hallucinate consistently places it much below a human expert on any topic that I care to understand correctly.

That attitude is going to cost you. You'll have no choice but to abandon it at some point, as the LLM implementations get better. The improvements in GPT4 over 3.5 alone are enough to dispel a lot of my own initial skepticism.
> That attitude is going to cost you.

I don’t think it will cost me much to not use the explicitly-not-a-search-engine thing as a search engine.

Which LLM will you use to verify that ChatGPT is more knowledgeable than human experts on a given topic?

The thing is, your mistake isn't just distrusting the language model, it's trusting the search engine. No matter what tool you use, the responsibility for ensuring accuracy is ultimately yours. Similar degrees of caution and skepticism must be applied to results from both ML and traditional search engines.

They are both insanely powerful tools, and like most insanely powerful tools, the hazards are considerable.

These are explanations that make sense to people who already know how deep learning works but don't really explain much to beginners beyond giving them a grossly oversimplified misrepresentation of what is being discussed (while not actually explaining anything).

My advice to folks is, if you actually want to know how this stuff works at some basic level, put in some time learning how basic linear and logistic regression work, including how to train it using back propagation. From there you'll have a solid foundation that gives enough context to understand most deep learning concepts at a high level.

It was intended as a demystification, not a total explanation. There are millions of places explaining with technical details.
> why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this?

when it can hallucinate content, why do that instead of reading a blog post from an expert?

Oh no, it will hallucinate an obscure fact, but not basics. It's pretty good at reciting theory, it would pass many ML engineering theoretical interviews.

If you don't trust its memory, copy a piece of high quality text in the topic of interest inside the context, as reference.

it's repeatedly made up entire quotes and research papers?
Not the OP, I'm still hesitant because it infuriates me I have to give them my identity which they will then log every prompt against. You think they aren't building profiles on people? AI moties(more in gods eye reference )is what they are.
I think this is the right answer, ChatGPT is an excellent 1-1 tutor.
Andrej Karpathy's Zero to Hero video series [1] is a good middle ground. It isn't super low-level but it also isn't super high-level. I think seeing how the pieces actually fit together in a working project is valuable to get a real understanding.

After going through this series I can say I basically understand weights, tokens, back-propagation, layers, embeddings, etc.

1. https://karpathy.ai/zero-to-hero.html

I'm working my way through that series now. He really is a good teacher -- I keep waiting for the inevitable "Next, draw the rest of the fucking owl" moment, but so far he does seem to be sticking to his commitment to a from-scratch approach.
When was this published? Is this an older tutorial by Karpathy?

Just curious, didn't see any date...

The first class is 8 months old and the latest one is 3 months old. If you click on the links, they'll direct you to YouTube videos.
On youtube you can. First video 8 months ago.
Weights are basically number/float variables. In neural networks, vectors of values are multiplied (or math'd in some way) by weights to get new vectors of values. A 500 billion weight model has 500 billion variables, all carefully chosen via training.

A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.

Tokens are sort of "words" in a sentence, but the ML may be translating the word itself into a more abstract concept in 'word space': eg, a bunch of floating point values.

At least some of what I just said is probably wrong, but now someone will correct me and we'll both me more right!

At a first approximation this is pretty good. I wouldn't say this exactly:

> A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.

Because data doesn't really flow through weight matrices, though perhaps this is true if you squint at very simple models. Deep learning architectures are generally more complicated than multiplying values by weights and pushing the results to the next layer, though which architecture to use depends heavily on context.

> Tokens are sort of "words" in a sentence

Tokens are funny. What a token is depends on the context of the model you're using, but generally a token is a portion of a word. (Why? Efficiency is one reason; handling unknown words is another.)

> What a token is depends on the context of the model you're using, but generally a token is a portion of a word.

When doing quick estimates, I just assume every syllable is a token. It tends to overestimate, which is fine for my OOM mitigation purposes.

Probably not the answer you would like but I think your approach to download them and figure out how to run them on your machine is a good one. You don't need to understand everything to get something working. It can be overwhelming and unproductive to know everything before getting started.

To learn more deeply though, get started with getting it to work and when you are curious or something doesn't work, try to understand why and recursively go back to fill in the foundational details.

Example, download the code try to get it to work. Why is it not working? Oh it's trying to look for the model. Search for how to get the model and set it up. Then key step, recursively look up every single thing in the guide or set up. Don't try to set something up or fix some thing without truly understanding what it is you are doing (e.g. copy and paste). This gives you a structured why to fill in the foundations of what it is you are trying to get to work in a more focused and productive manner. At the end you might realize that their approach or yours is not optimal "oh it was telling me to download the 65k model when I can only run 7k on my machine bc ..."

For a good general non-technical introduction I recommend the YouTube computerphile series related to language models, transformers and other general concepts. If you are interested in actually doing stuff there’s an over abundance of material out there already, if you try looking.
I haven't watched it yet, but the Practical Deep Learning for Coders course that's available on YouTube is often recommended

https://course.fast.ai/

A book about AI. (Norvig and Russell comes to mind)
I'm always curious about the cost of these training runs. Some back of the envelope calculations:

> Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run

1 trillion / 1900 = 526315789 chip seconds ~= 150000 chip hours.

Assuming "on-demand" pricing [1] that's about $500,000 training cost.

[1] https://cloud.google.com/tpu/pricing

At these levels of spending the actual cost is heavily negotiated and is usually far below the advertised on-demand pricing.

Considering I could negotiate A100 for under a dollar/hr - 8 months ago, when they were in high demand, I wouldn't be surprised if the cost was close to 100k for this training run.

Nobody in their right mind is using GCE for training. Take a look at real prices: https://vast.ai/
I got the impression that kind of thing (buying time on GPUs hosted in people's homes) isn't useful for training large models, because model training requires extremely high bandwidth connections between the GPUs such that you effectively need them in the same rack.
I suspect most A100s on vast.ai are actually in a datacenter, and might even be on other public clouds, such as AWS. I don't see why either vast.ai or AWS care if this was the case.
Is there at good resource that describes the impact of bandwidth and latency between GPUs?

I assume that it's completely impractical to train on distributed systems?

Anyone training this size of model is almost certainly using AWS/GCE.

The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.

Well, or Azure.
Ha yes of course. But actually has anyone been able to get instances on Azure? Thought OpenAI had them all reserved.
Aren't they explicitly using TPUs in their training? Vast AI are only offering GPUs.
These nodes typically have slow downstream, and thus are hard to use when training requires pulling a huge dataset.
Only 19 GPUs with 30+G of VRAM in the entire North America.

I might be misreading it. It might be just 12 GPUs.

They haven't trained a 1 trillion token model yet. They have only done 200bn so far
Google is generous for giving TPU for free for research, so likely it is using this. The more representative number is one from meta which required 87k A100 hours, which is close to $100-200k for 7B model training.
I am quite new to this, I would like to get it running. Would the process roughly be:

1. Get a machine with decent GPU, probably rent cloud GPU.

2. On that machine download the weights/model/vocab files from https://huggingface.co/openlm-research/open_llama_7b_preview...

3. Install Anaconda. Clone https://github.com/young-geng/EasyLM/.

4. Install EasyLM:

    conda env create -f scripts/gpu_environment.yml
    conda activate EasyLM
5. Run this command, as per https://github.com/young-geng/EasyLM/blob/main/docs/llama.md:

    python -m EasyLM.models.llama.llama_serve \
         --mesh_dim='1,1,-1' \
         --load_llama_config='13B' \
         --load_checkpoint='params::path/to/easylm/llama/checkpoint' \
Am I even close?
I think llama.cpp might be easier to set up and get running.

https://github.com/ggerganov/llama.cpp

I second this recommendation to start with llama.cpp. It can run on a regular laptop and it gives a sense of what's possible.

If you want access to a serious GPU or TPU, then the sensible solution is to rent one in the cloud. If you just want to run smaller versions of these models, you can achieve impressive results at home on consumer grade gaming hardware.

The FastChat framework supports the Vicuna LLM, along with several others: https://github.com/lm-sys/FastChat

The Oobabooga web interface aims to become the standard interface for chat models: https://github.com/oobabooga/text-generation-webui

I don't see any indication that OpenLLaMa will run on either of those without modification. But one of those, or some other framework may emerge as a de-facto standard for running these models.

Yes, I can clone this and get into a prompt in less than 5 minutes on an M2 MBA.
might try it first. seems to be only CPU?
It has partial gpu acceleration if you compile it with LLAMA_CUBLAS or LLAMA_CLBLAST

They really have come a long way since... A few weeks ago.

Using cublas with my 1080ti results in a 52% speedup compared to cpu-only. Vram usage is very minimal.

I'd see that as a benefit of llama.cpp - it's specifically designed to be usable on consumer hardware such as laptops, without professional GPUs.
You can get it running with one Python script on Modal.com :)

https://github.com/modal-labs/modal-examples/blob/main/06_gp...

Ok you lot! Will try out modal.
Yeah it is pretty nice. Not sure how long it took, but less that the time to make a sandwich (2 minutes). It cost 2-3c a pop so sadly more expensive than GPT3.5. However maybe it can be optimised. Or maybe there is some init cost that could be store in state.

    (modal) fme:/mnt/c/temp/modal$ modal run openllama.py
    ? Initialized. View app at https://modal.com/apps/ap-9...
    ? Created objects.
    +-- ?? Created download_models.
    +-- ?? Created mount /mnt/c/temp/modal/openllama.py
    +-- ?? Created OpenLlamaModel.generate.
    +-- ?? Created mount /mnt/c/temp/modal/openllama.py
    Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]Downloading shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:00<00:00, 1733.54it/s]
    Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00,  5.70s/it]Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00,  6.23s/it]
    Building a website can be done in 10 simple steps:
    1. Choose a domain name. 2. Choose a web hosting service. 3. Choose a web hosting package. 4. Choose a web hosting plan. 5. Choose a web hosting package. 6. Choose a web hosting plan. 7. Choose a web hosting package. 8. Choose a web hosting plan. 9. Choose a web hosting package. 10. Choose a web hosting plan. 11. Choose a web hosting package. 12. Choose a web hosting package. 13. Choose a web hosting package. 14. Choose a web hosting
    ? App completed.
Thanks for trying it out!

2-3c per run seems very high. That's probably just the cost if you have to spin up a new container. You can shorten the idle timeout on a container if its going to just serve one request typically. If it's going to serve more requests, then the startup and idle shutdown cost is amortized over more requests :)

I found this was the cost per call to a web function. I used deploy to deploy it. The function just does what the main did in the example repo (earlier in this theead)
How is this model performing better than LLaMa in a lot of tasks[1] even though its trained on a fifth of the data (1 trillion vs 200 billion).

[1]https://github.com/openlm-research/open_llama#evaluation

They are likely doing some interpolation for 200B or benchmarking it in wrong way. e.g. Hellaswag accuracy for llama 7b is 0.76[1], but it is written 0.56 in the repo. Even at 200B tokens, it is higher than 0.56 for llama looking at the charts.

[1]: https://arxiv.org/pdf/2302.13971.pdf

They ran lm-evaluation-harness on both this model and the original llama weights, which is the correct way to do it.

Many people have been struggling to reproduce the benchmark numbers included in the original llama paper.

Nobody knows :^)
Maybe it uses a higher quality dataset
Would be very interesting to see https://github.com/BlinkDL/RWKV-LM trained on the same data
Interesting. Have you done anything with RWKV?
I evaluated RWKV recently, and it's interesting for sure. It's undertrained, and has a quirky architect, so some parts of it are different than playing with the llama ecosystem. The huge context length is super appealing, and in my tests, long prompts do seem to work and get coherent results.

Where it's slow is in tokenization -- it can be very, very slow to make an initial tokenization of a prompt. I think this has to do with how the network actually functions, like there's a forward loop that feeds each token in to the network sequentially.

I would guess if it had the same level of attention and work that the Llama stack is getting it would be pretty fantastic, but that's just a guess, I'm a hobbyist only.

Nope, not yet, the current 14B version is much worse than LLaMA 65B. But there are apparently plans to train a RWKV-65B by the end of the year, and if including the LLaMA training dataset results in something like LLaMA-65B but with infinite context then that'd be really amazing.
How is this different from what RedPajamas is doing?

Also, most people don't mind running LLaMA 7B at home so much because of enforceability, but a lot of commercial businesses would love to run a 65b parameter model if possible and can't because the license is more meaningfully prohibitive in a business context. Open versions of the larger models are a lot more meaningful to society at this point.

RedPajama is creating a dataset. This is a permissively licensed model trained on that dataset.
RedPajama is also training both foundation and instruct-tuned models

Source: https://twitter.com/togethercompute/status/16527350961501757...

I agree with this. For a lot of companies hundreds of thousands of dollars or single digit millions on fine tuning, inference, and so on is entirely feasible but using model weights with clouded legal status isn’t.
Really exciting how fast fully pre-trained new models are appearing.

Here's another repo (with the same "open-llama" name) that has been available on hugging face as well for a few weeks. (different training dataset)

https://github.com/s-JoL/Open-Llama https://huggingface.co/s-JoL/Open-Llama-V1

Is anyone familiar with the BOINC-style grid computing scene for ML and, specifically, LLM? Is there something interesting going on, or is it infeasible? Will things like OpenLLaMA help it?
They seem to scale up, not out, so grids don't really work.

What everyone is using are HPC grade low latency interconnects to make the cluster look as close as possible to a single big TPU.

"They seem to scale up, not out, so grids don't really work."

Can someone explain what this means? I don't understand.

https://openmetal.io/docs/edu/openstack/horizontal-scaling-v...

In a typical fully connected hidden layer, the neurons each need to compute the values of the all others in the previous layer, so you need all the data in one place. Obviously you can distribute the actual calculations which is what a GPU does, but distributing that over networked CPUs will be incredibly slow and require the whole thing to be loaded into memory on all instances.

My bet is on some kind of light based or analog electric accelerator PCIE card to be the next best thing for this sort of inference, since it should be able to calculate multiple layers at once. FPGAs also work but only for fixed weights.

Further than that, with big models and training rounds that want to update potentially all the values, you can't even split the work by saying "report the fitness of this model against this cost function and report back in however much time your CPU needs" because shipping around the model and data is impractical.
I mean yeah, even just doing regular inference is borderline impossible on a normal machine given that we're even having this discussion. Training is just completely unfeasible.
Up=bigger machine

Out=lots of machines through network

The more you split it up outwards (across more nodes), the more communication among nodes that is required, which doesn’t lend itself well to regular Internet connections, which means it would prefer to scale upwards with more GPU/CPU/memory capacity per node.
I haven't looked into it or tried it yet, but there is https://petals.ml/
Can someone explain how to tell if a model doesn't require a GPU and can run on a CPU?

After setting up dalai, OpenAssistant, gpt4all and a bunch of other (albeit nonworking) LLM thingies, my current hunch is:

if the model somewhere has "GGML" in its name, it doesn't require a GPU.

Technically anything that's based on pytorch can run on CPU, you just need to tell it to do so. For example, in textgen add '--cpu' and you're done. It will be super slow though.

GGML format is meant to be executed through llama.cpp, which doesn't use GPU by default. You can often find these models in a quantized form as well, which helps performance (at a cost of accuracy). Look for q4_0 for the fastest performance and lowest RAM requirements, look for 5_1 for the best quality right now (well, among quantized models).

Oh yeah, textgen supports llama.cpp, and also provides API, so it looks like a clear winner. You might want to manually pull newer dependencies for torch and llama.cpp though:

pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.h... pip install -U llama-cpp-python

Has anyone successfully used embeddings with anything other than OpenAI's APIs? I've seen lots of debates on using embeddings vs fine-tuning for things like chatbots on private data, but is there a reason why you can't use both? IE, fine-tune LLaMA on your data, then run the same embeddings approach on top of your own fine-tuned model?
> We are currently focused on completing the training process on the entire RedPajama dataset.

So that's 1.2 trillion tokens. Nice.

Forgive me for the ignorance, but can a refined training model be a specific codebase, after say training on all standard docs for the language, and 3rd party libs, and so on.

I have no formal idea how this is done, but my assumption is that "something like that" should work.

Please disabuse me of any silly ideas.

Hi Jason! I have a few thoughts on this!

Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's very expensive and can disrupt the usefulness of having all the generalizations baked in from training data [1].

While LLMs can generate text based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. But I do think they hold a lot of promise in reasoning.

Small corollary: LLMs do not know a head of time what they are generating. Secondly, they use the input from you and itself to drive the next message.

This sets us up for a strategy called in-context learning [1]. We take advantage of the above corollary and prime the model with context to drive the next message. In your case, a query about some specific code base with knowledge about standard docs etc.

Only there is a big problem, context sizes. Damn. 4k tokens?

We can be clever about this but there is still a lot of work and research needed. We can take all that code and standard docs and create embeddings of them [2]. Embeddings are mathematical representations of words or phrases that capture some of their semantic meaning. Basically the state of a trained neural network given inputs.

This will allow us to group similar words and concepts together closer in what is called a vector space. We can then do the same for our query and iterate over each pair finding the top-k or whatever most similar pairs. Many ways to find the most similar pairs but what's nice is cosine similarity search. Basically a fancy dot product of the pairs with a higher score indicating greater similarity. This will allow us to prime our model with the most "relevant" information to deal with the context limit. We can hope that the LLM would reason about the information just right and voila.

So yeah basically create a fancy information retrieval system that picks the most relevant information to give your model to reason about (basically this [3]). That and while also skirting around the context limitations and not overfitting and narrowing the training information that allow them to reason (controversial).

1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf

2: Embeddings https://arxiv.org/pdf/2201.10005.pdf

3: https://twitter.com/marktenenholtz/status/165156810719298355...

Much appreciated, Sun fearing dude, much appreciated.
You can train the model on more training data after it has been released.
So is this free as in “do what you f’ing like with it”?
I made a YouTube video on how to run OpenLLaMa on Google Colab with Hugging Face Transformers (using a T4 GPU): https://www.youtube.com/watch?v=1NOPciKuQb8

Hope that helps!

Has anyone actually used this? I poked around and it's so poorly documented that I don't see how one can readily, short of trying to go through the code, understand how to do a minimal run.
I've used it with llama.cpp; results are not great, but not entirely terrible (I'd say somewhere between GPT-2 and GPT-3). Still, totally free and open source is great and I'm looking forward to more development from them (and others building on top like an RLHF / alpaca / chat kind of thing).
Thanks for answering! In my skim of the thread I only saw people mention trying it with llama.cpp. I tried to get his EasyML framework going but could not figure out the parameters I needed. Definitely agree it's great to see real open source models being built.
Motivation?
Happily, licensing.
why the hell will you be happy about duplicate work?
Actually, replication is very important. If no one can make new llamas, that would mean that facebook used some secret sauce in their training. Understanding publicly how to train these 'enhanced' models that shows performance of much greater models is a very strong motive.

And getting hid of the NC clause of the original llamas too, of course.

As of right now, there's trouble replicating the eval results of the paper, for example.

Yeah but that wasn't the reason, was it? They didn't do it because they wanted to replicate work, they did it because they didn't want the Meta lawyers to be big mad at them.
Good luck convincing Meta to release their models with a proper licence.
That is why its sadly, licensing.
Sadly, licensing.