| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saurabh20n 1212 days ago

Quick notes from first glance at paper https://research.facebook.com/publications/llama-open-and-ef...:

* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]

* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]

* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.

* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]

* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)

8 comments

machinekob 1212 days ago

I hate when people don't include approximation for traning before final hyperparameters are found as its most costly part of whole process most of the time.

Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-

scotty79 1211 days ago

Aren't those done on smaller version of the same model?

323 1212 days ago

> we used 2048 A100-80GB for a period of approximately 5 months

Do we know how much total energy a human consumes from birth to 20 yo? Something like 2000 calories integrated over 20 years. How does it compare to the GPUs above?

Wolfram Alpha:

- human - 17 MW/h ((2000 calories per day) over 20 years in MWh)

- GPUs - 3000 MW/h ((2048 * 400) W over 5 months in MWh)

We still have the edge.

LOL, I'm being downvoted, I wonder way. Some don't like the question.

zhynn 1212 days ago

You have to include our evolutionary history too. A considerable amount of our sophisticated behavior doesn't require specific training, as it is encoded in our genetic and epigenetic systems. We aren't starting from zero.

osigurdson 1212 days ago

Then you would need to include the our history in the GPU calculation. GPUs require evolutionary bootstrapping - they didn't materialize alongside the first few hydrogen atoms post BB.

melling 1212 days ago

Every human requires the same energy, 20+ years, and training.

The trained computer model can be duplicated and used, requiring much less energy.

None of this matters to me, though.

The goal is to build better models. We can worry about the efficiency later.

swyx 1211 days ago

exactly. we are speedrunning 200,000 years of intelligent life evolution here.

isoprophlex 1212 days ago

You mean MWh maybe, not MW/h? (which is what, J/s^2 in SI... "Power rate".)

323 1212 days ago

Right, I used the correct MWh in Wolfram, but for some reason wrote MW/h, I think it was written like that a long time ago on electricity bills.

Dylan16807 1212 days ago

> We still have the edge.

Depends on what you're doing. A human is much smarter than one of these models, but the model has approximate knowledge of orders of magnitude more things. And the energy costs per word of output are a lot closer.

Tepix 1210 days ago

Don't mix MW/h with MWh.

Anyway, i remember hearing that the brain uses 60 Watt. That's 10.5MWh in 20 years.

But, we can't transfer/copy that gained knowledge limitlessly.

robbiep 1212 days ago

It’s because your human math for power output is so far off it’s hard to know where to start to point you in the right direction

323 1212 days ago

Please do tell. Or better provide your estimation. I just took raw calorie intake, no heating/transportation/lighting/computer usage/....

WASDx 1212 days ago

A thing to keep in mind is that 1 MWh of raw calories takes much more than 1 MWh to produce (fuel for tractors, inefficiency of meat etc). The GPU energy is also easier to make renewable.

I did an extremely rough calculation recently that the training of GPT-3 is comparable to one transatlantic flight (all passengers combined) in terms of emissions, very depending on the energy mix of course.

Teever 1212 days ago

That's the entire problem. There's so much more energy that goes into a modern human beyond just what they eat. Beyond physical items you've listed like clothing there's also education and healthcare. Those two institutions are critical in making a modern human and they both have their own dependency chains of physical resource, energy, and the input of even more humans.

programmer_dude 1212 days ago

Your units are bad. Did you mean MWh instead of MW/h?

zozbot234 1212 days ago

https://github.com/facebookresearch/llama/blob/main/MODEL_CA... (linked in OP) has basic information about this model.

SethTro 1212 days ago

(1022362 + 82432) gpu-hours / 2048gpus / 5 months ~= 15% uptime.

That's only 0.08 nines of availability!

I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.

Tepix 1210 days ago

They may have thrown away some models that didn't turn out great.

foobiekr 1212 days ago

Poor GPU utilization even when available is the rule. Truly amazing. Staging of data is probably a huge part of it.

pavelstoev 1212 days ago

At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs

foobiekr 1211 days ago

90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.

mirker 1212 days ago

Is it failures or is this some backfill/budget scheduling while everyone is sleeping?

foobiekr 1211 days ago

A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.

woeirua 1212 days ago

These cost estimates really make me question OpenAI's valuation.

Also, they kind of prove to me that most companies are totally incapable of making the investments necessary to get much out of this type of AI.

pgt 1212 days ago

Financial hurdles to competitors can make the company that has overcome them more defensible.

machinekob 1211 days ago

Sadly big players take all in current world and microsoft is pretty big :|

sandGorgon 1212 days ago

>* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.*

what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?

vishal0123 1212 days ago

Scaling law is for training till convergence. Both PALM and this model have been undertrained. See the training loss plot in the paper.

sandGorgon 1211 days ago

hey thanks for your reply.

umm...so does OpenAI. In fact this is OpenAI discovery from [1]:

>Convergence is inefficient: When working within a fixed compute budget C but without any other restric- tions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)

>We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8)

P.S. Not trolling. genuinely trying to learn.

[1] https://arxiv.org/abs/2001.08361

cubefox 1211 days ago

This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556

sandGorgon 1210 days ago

hi again - genuinely trying to learn here. The Chinchilla paper is a COMPETING thesis right ? the OpenAI thesis hasnt changed or superseded here.

vishal0123 1210 days ago

LLAMA made tradeoff for reducing parameter budget instead of training computation budget. This is better for inference computation budget.

Optimal number of tokens for 7B parameters is around 140B tokens[0], and meta trained it for trillion tokens.

[0]: https://arxiv.org/pdf/2203.15556.pdf

akomtu 1212 days ago

By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data - more than enough to memorize a 1.5T sequence of "tokens" (3 letter triplets?). This begs the question: are these models better than a fuzzy hash table?

hansvm 1212 days ago

Yes and no. Information theoretically, tokens are pretty well compressed, and you can't get another 6x losslessly.

Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).

That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.

Your question feels like it has a motive though. What are you really asking?

akomtu 1210 days ago

LLMs need a baseline to compare with. I suspect that when they get compared with a fuzzy hash table of a similar size (that returns a range of probabilities), their performance will become unimpressive.

hansvm 1210 days ago

You can just directly calculate what would happen. To respond to novel words (which these demonstrably do) it needs to be equivalent to a character-wise hash table, and to be the same size as LLaMA you can do lookups on around 4 characters (and you have to deal with the data sparsity in constructing many of those tuples). If you want worse output but a better hash table on the output that remains, you could hash words or common words and get contexts of up to a few words rather than a few letters.

LLMs can track mid-range dependencies though. Consider the following input

> Translate the phrase "the lazy brown fox jumped over the thorny brambles" into French, write the translation, and then write the second through fourth words of that translation.

Looking at any one word of the output you need to track many of the input words to get it correct, and the relative positions of those necessary input words is not consistent from one output word to the next. ChatGPT solves the task flawlessly (aside from its habit of explaining what it's doing before doing it). Any hash table solution, at a minimum, would need a complicated heuristic for determining which words/characters to look up.

Doing so brings us back closer to the state of language models before transformers. You had a lot of hand-tuned features, formal grammars, complicated orders of operations, expert lookup tables, and whatnot. Performance was still much, much worse than what we're getting now with deep learning.

None of that is to say that philosophically we're doing anything more than mishmashing probabilities or that something better doesn't exist, but without significant innovation rule-guided fuzzy hash tables aren't it.

akomtu 1210 days ago

The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.

The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.

hansvm 1210 days ago

Ah, so less a hash table and more vanilla KNN?

Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).

make3 1212 days ago

do they do instruction fine-tuning