Hacker News new | ask | show | jobs
by cs702 1121 days ago
The authors conduct automated, more methodical evaluations of LLMs finetuned to imitate ChatGPT outputs, and find that, despite superficial/informal appearances to the contrary, the base LLMs close little to none of the gap to ChatGPT on tasks that are not heavily supported in the imitation data.

It's not good news for the open LLM ecosystem.

4 comments

This is a very weird type of paper. They take a specific approach, then make arguments about a broad class of approaches that are under constant development. The finding that distilled LLMs must be more specialized than the giant LLMs that train them is unsurprising; nobody at this point expects a 13B parameter model to succeed with the same accuracy at the broad range of tasks supported by what may be a 1T parameter model.
> nobody at this point expects a 13B parameter model to succeed with the same accuracy at the broad range of tasks supported by what may be a 1T parameter model

I think a lot of people believe exactly that. To take one example from the "We Have No Moat" essay:

"It doesn’t take long before the cumulative effect of all of these fine-tunings overcomes starting off at a size disadvantage. Indeed, in terms of engineer-hours, the pace of improvement from these models vastly outstrips what we can do with our largest variants, and the best are already largely indistinguishable from ChatGPT." - https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

That essay works in a context of specific datasets and tasks, which are referenced in the surrounding sentences and paragraphs. They are saying that for a particular "emergent" capability you might reach with a giant LLM, you might get there more efficiently with distillation / LoRa.

My comment is about generality, which is the remaining advantage of giant models.

That is exactly what people are expecting, and largely because of misleading metrics thrown around to claim ridiculous things like e.g. Vicuna-13b being nearly as good as GPT-3.5. It even shows up in the comments here, and if you go to any tangentially related subreddit, that's the kind of stuff that gets told as "everybody knows" to people setting up a local LLM for the first time.
The Vicuna headline was def an overreach, although the main text admits pretty readily that their performance test (asking ChatGPT to evaluate quality) is not rigorous [1]. I'm sure that has set a lot of people pontificating about AGI with tiny models, but I can't imagine anyone who has worked directly with fine tuning having that impression.

The comments I see here are not about that. They are about small models succeeding at specific tasks, which is affirmed by this paper. Most applications of LLMs are not general purpose chat bots, so this is not bad news for most of the distill/fine tune community.

[1] https://lmsys.org/blog/2023-03-30-vicuna/

Even if they don't start out expecting that, people might be fooled by how it behaves when they try it out. So it seems useful to point out that initial impressions based on crowdsourced evaluations are misleading:

> We were initially surprised by how much imitation models improve over their base models: they are far better at following instructions, and their outputs appear similar to ChatGPT’s. This was further supported by both human and GPT-4 evaluations, where the outputs of our best imitation model were rated as competitive with ChatGPT (e.g., Figure 1, left).

("Competitive" meaning that 70% outputs seemed about as good.)

I don't know if it's bad news per say. It helps to know where to deploy a tool, it's limitations and where to focus to build something competitive / better.
They don't even use more methodological evaluations of fine-tuned LLMs, they use metrics that are specifically built to support a (false) contrarian conclusion in order to generate attention for their "paper."
Good news for alignment though. This gives me a tiny amount of hope.
So, LLMs aligned with the interests of our corporate overlords and that nebulous "national security" thing that somehow always translates to more surveillance and less due process?
This tech has about as much chance to continue unregulated as highly enriched uranium. There is no future-path that includes unregulated AI.

I don't like horrific government abuse of residents,and I would not mind throwing most billionaire CEOs into a pool of alligators and dissolving their corporations. I don't like Altman, I think he's a smart person with NOBUS-level reckless hubris who is softballing the magnitude of the dangera to wet. The status quo is not good and it's getting worse.z

It doesn't matter. 5 people with launch-all-the-nukes buttons is better than 500 million.

Fewer people agree with your premise, and that’s fortunate.

The “AI is dangerous” premise has no basis whatsoever. No one can prove it. No one can present a great thought experiment. Just doomsaying coupled with volume.

It’s starting to come off like a hidden agenda.

> Fewer people agree with your premise, and that’s fortunate.

Datacenter NVIDIA cards are already on the export control list for potential military use, and that was pre ChatGPT and GPT-4:

>On August 26, 2022, the U.S. government, or USG, informed NVIDIA Corporation, or the Company, that the USG has imposed a new license requirement, effective immediately, for any future export to China (including Hong Kong) and Russia of the Company’s A100 and forthcoming H100 integrated circuits. DGX or any other systems which incorporate A100 or H100 integrated circuits and the A100X are also covered by the new license requirement. The license requirement also includes any future NVIDIA integrated circuit achieving both peak performance and chip-to-chip I/O performance equal to or greater than thresholds that are roughly equivalent to the A100, as well as any system that includes those circuits. A license is required to export technology to support or develop covered products. The USG indicated that the new license requirement will address the risk that the covered products may be used in, or diverted to, a ‘military end use’ or ‘military end user’ in China and Russia. The Company does not sell products to customers in Russia.

https://www.sec.gov/Archives/edgar/data/1045810/000104581022...

> The “AI is dangerous” premise has no basis whatsoever. No one can prove it. No one can present a great thought experiment. Just doomsaying coupled with volume.

If you increase the number of persuasive Gobbels and hackers attacking infrastructure by 100,000x you do not come away with a better world.

> It’s starting to come off like a hidden agenda.

AI was used to fake the moon landing and hide bigfoot /s

I've heard many people advancing this thesis (usually by exactly the people who would benefit from such regulation), but no cogent arguments for it. Why do you think modern large-scale statistics needs to be regulated?