| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lappa 1109 days ago

Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5!

AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.

- Llama 1 (llama-65b): 57.6

- LLama 2 (llama-2-70b-chat-hf): 64.6

- GPT-3.5: 85.2

- GPT-4: 96.3

HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.

- Llama 1: 84.3

- LLama 2: 85.9

- GPT-3.5: 85.3

- GPT-4: 95.3

MMLU (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

- Llama 1: 63.4

- LLama 2: 63.9

- GPT-3.5: 70.0

- GPT-4: 86.4

TruthfulQA (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.

- Llama 1: 43.0

- LLama 2: 52.8

- GPT-3.5: 47.0

- GPT-4: 59.0

[0] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb... [1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

7 comments

gitgud 1109 days ago

Is it possible that some LLM’s are trained on these benchmarks? Which would mean they’re overfitting and are incorrectly ranked? Or am I misunderstanding these benchmarks?…

FanaHOVA 1109 days ago

Presented with no comment :) https://twitter.com/chhillee/status/1635790330854526981?s=46...

lumost 1109 days ago

Having worked on ML products, there is sometimes debate on whether you should train on the test partition prior to prod deployment - after all, why would you ship a worse model to prod? Obviously you can't tell whether the model is better at generalization compared to an alternate technique, and you also incur some overfit risk. But many industrial problems are solvable through memorization.

sangnoir 1109 days ago

> after all, why would you ship a worse model to prod?

...because you need a control to evaluate how well your product is doing? I know it's a young field, but boy, do some folk love removing the "science" from "data science"

baobabKoodaa 1109 days ago

You can evaluate a version of the model that has been trained on one set of data, and ship to production a different model that has been trained on the complete set of data. In many cases one can reasonably infer that the model which has seen all of the data will be better than the model which has seen only some of the data.

I'm not claiming that's what happened here, nor am I interested in nitpicking "what counts as 'science'". I'm just saying this is a reasonable thing to do.

mafuy 1108 days ago

This is possible if you use e.g. train 1000 models on different subsets of data and verify that each and every one of them is performing well. In that case, you can reasonably infer that another model trained on all data would work well, too.

But this is, of course, 1000 times more expensive to do. And if you only train 100, or 10, or 1 model, then the deduction becomes increasingly unstable.

So from a practical point of view, it's probably not feasible, because you would put those resources into something else instead that has more ROI.

Naracion 1107 days ago

>infer that the model which has seen all of the data will be better than the model which has seen only some of the data.

It really depends upon the data. A smaller set of data that mostly consists of "truth" might be better than a larger dataset that also has many "lies".

Perhaps what you mean is that the model might be more representative, rather than _better_.

janalsncm 1109 days ago

There are offline metrics and online metrics. Offline metrics might be something like AUROC on a test set. Once you’ve pushed the model online, you can check the online metrics. Ultimately the online metrics are more important, that’s the whole reason the model exists in the first place.

Your control in an online environment is the current baseline. You don’t need to save the test set anymore, you can push it online and test it directly.

snowstormsun 1109 days ago

Why would you want to ship an untested model? That's insane.

baobabKoodaa 1109 days ago

This is a common approach, for example, in data science competitions. Why? Well, if you want to maximize the model's abilities, this is what you have to do. (Not saying Llama 2 is released like this; it probably isn't)

snowstormsun 1109 days ago

Yeah but in competitions there's a secret test set used to evaluate the model.

sundarurfriend 1109 days ago

Nitter link: https://nitter.net/chhillee/status/1635790330854526981/

stevefan1999 1109 days ago

Unfortunately, Goodhart's law applies on most kind of tests

> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

iambateman 1109 days ago

This is SAT-prep in a nutshell. :)

famouswaffles 1109 days ago

Test leakage is not impossible for some benchmarks. But researchers try to avoid/mitigate that as much as possible for obvious reasons.

pclmulqdq 1109 days ago

Given all of the times OpenAI has trained on peoples' examples of "bad" prompts, I am sure they are fine-tuning on these benchmarks. It's the natural thing to do if you are trying to position yourself as the "most accurate" AI.

famouswaffles 1109 days ago

Assuming they were doing that, Fine-tuning on benchmarks isn't the same as test leakage/testing on training data. No researcher is intentionally training on test data.

If it performs about as well in instances it has never seen before (test set) then it's not overfit to the test.

nightski 1109 days ago

I'm confused, fine-tuning is training. How is that not leakage? I'm hesitant to call them researchers, they are employees of a for-profit company trying to meet investor expectations.

famouswaffles 1109 days ago

1.You train on the kind of problems you want to solve. you don't report numbers that evaluate performance based on examples it trained on. Datasets will typically have splits, one for training and another for testing.

2. Open ai is capped profit. They are also not a publicly traded company. researchers are researchers regardless of who they work for. Training on test data is especially stupid for commercial applications because customers find that out quick and any reputation is gone.

clarge1120 1109 days ago

Besides, OpenAI dropped all pretense of being open and transparent as soon as they saw how popular their open and transparent technology had become.

TX81Z 1109 days ago

“No researcher is intentionally training on test data.”

Citation Needed.

sp332 1109 days ago

Yeah, it happens. https://hitz-zentroa.github.io/lm-contamination/blog/

option 1109 days ago

that’s why OpenAI didn’t release any details on GPT4 training data blend ;)

bbor 1109 days ago

It would be a bit of a scandal, and IMO too much hassle to sneak in. These models are trained on massive amounts of text - specifically anticipating which metrics people will care about and generating synthetic data just for them seems extra.

But not an expert or OP!

stu2b50 1109 days ago

I don't think it's a scandal, it's a natural thing that happens when iterating on models. OP doesn't mean they literally train on those tests, but that as a meta-consequence of using those tests as benchmarks, you will adjust the model and hyperparameters in ways that perform better on those tests.

For a particular model you try to minimally do this by separating a test and validation set, but on a meta-meta level, it's easy to see it happening.

jasonfarnon 1109 days ago

You don't see an engineer at an extremely PR-conscious company at least checking how their model performs on popular benchmarks before rolling it out? And if its performance is lackluster, you do you really see them doing nothing about it? It probably doesn't make a huge difference anyway. I know those old vision models were overfitted to the standard image library benchmarks, but they were still very impressive.

fbdab103 1109 days ago

Famously, some of the image models were so overtrained they could still yield impressive results if the colors were removed.

lumost 1109 days ago

This wasn't so much overtraining, as the models learning something different than what we expected. If you look at a pixel by pixel representation of an image, textures tend to be more significant/unique patterns than shapes. There are some funny studies from the mid 2010s exploring this.

moneywoes 1109 days ago

How would it even be possible to verify that?

mdp2021 1109 days ago

"Verify", that's quite a demand;

"corroborate", you find queries of the same level which would give satisfactory output upon good performance but fail in a faulty overfitted model.

doctoboggan 1109 days ago

Good to see these results, thanks for posting. I wonder if GPT-4's dominance is due to some secret sauce or if its just the first mover advantage and Llama will be there soon.

Roark66 1109 days ago

In chatgpt there is plenty of "secret sauce" in their output sampling, sending the output for scoring by another model.

As for Gpt4, allegedly it is a combined model(many domain specific models) so perhaps add extra input processing by yet another model to detect problem domain and send it to the right specialised model.

famouswaffles 1109 days ago

It's just scale. But scale that comes with more than an order of magnitude more expense than the Llama models. I don't see anyone training such a model and releasing it for free anytime soon

bbor 1109 days ago

I thought it was revealed to be fundamentally ensemblamatic in a way the others weren’t? Using “experts” I think? Seems like it would meet the bar for “secret sauce” to me

famouswaffles 1109 days ago

Sparse MoE models are neither new nor secret. The only reason you haven't seen much use of them for LLMs is because they would typically well underperform their dense counterparts.

Until this paper (https://arxiv.org/abs/2305.14705) indicated they apparently benefit far more from Instruct tuning than dense models, it was mostly a "good on paper" kind of thing.

In the paper, you can see the underperformance i'm talking about.

Flan-Moe-32b(259b total) scores 25.5% on MMLU pre Instruct tuning and 65.4 after.

Flan 62b scores 55% before Instruct tuning and 59% after.

cubefox 1109 days ago

This paper came out well after GPT-4, so apparently this was indeed a secret before then.

famouswaffles 1109 days ago

The user I was replying to was talking about the now and future.

We also have no indication sparse models outperform dense counterparts so it's scale either way.

HeWhoLurksLate 1109 days ago

Is there a difference here between a secret and an unknown? It may well be that some researcher / comp engineer had an idea, tried it out, realized it was incredibly powerful, implemented it for real this time and then published findings after they were sure of it?

I'm more of a mechanical engineering adjacent professional than a programmer and only follow AI developments loosely

l33tman 1109 days ago

The quoted paper yes, but the MoE concept and layers and training is old.

Published as a conference paper at ICLR 2017

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean

fnordpiglet 1109 days ago

GPT4 is rumored to have 1.7T parameters, Llama 2 70B.

az226 1109 days ago

230x8 MoE.

Roark66 1109 days ago

I have to say in my experience falcon-40b-instruct got very close to chatgpt (gpt-3. 5),even surpassing it in few domains. However, it is important to note (not at all)OpenAI are doing tricks with the model output. So comparing OS models with just greedy output decoding (very simple) is not fair for OS models.

Still, I'm very excited this model at 13B seems to be matching falcon-40B in some benchmarks. I'm looking forward to using it :-)

fnl 1109 days ago

> OpenAI are doing tricks with the model output

Do you have any pointers to the “tricks” that are being applied?

jcuenod 1108 days ago

Sounds like a reference to Mixture of Experts

zzzzzzzza 1107 days ago

could be something like prompt rewriting or chain of thought or reflexion going on in the background as well

ineedasername 1109 days ago

When were the GPT-4 benchmarks calculated, on original release or more recently? (curious per the debate about alleged gpt-4 nerfing)

lappa 1109 days ago

They're based on the original technical report.

"Refuel" has run a different set of benchmarks on GPT-3.5 and GPT-4 and found a decline in quality.

https://www.refuel.ai/blog-posts/gpt-3-5-turbo-model-compari...

ShamelessC 1109 days ago

Plenty of the complaints/accusations predate the release of the 0613 set of models.

To be clear, I have trouble with the theory as I have not yet seen evidence of "nerfing". What you provided is actually the _only_ evidence I've seen that suggests degradation - but in this case OpenAI is being completely transparent about it and allows you to switch to the 0314 model if you would like to.

Every complaint I have seen has been highly anecdotal, lacking any rigor, and I bet are explained by prolonged usage resulting in noticing more errors. Also probably a bit of "the magic is gone now" psychological effect (like how a "cutting edge" video game such as Half-Life 2 feels a bit lackluster these days).

digitcatphd 1109 days ago

Could it be the case that many of these benchmarks are just learning this material included in their parameters?

marcopicentini 1109 days ago

How they compare the exact value returned in a response? I found that returning a stable json format is something unpredictable or it reply in a different language.

redox99 1109 days ago

Your Llama2 MMLU figure is wrong

sebzim4500 1109 days ago

Looks like he copied it from https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

I see different figures in different places, no idea what's right.