| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cs702 1121 days ago
	The authors conduct automated, more methodical evaluations of LLMs finetuned to imitate ChatGPT outputs, and find that, despite superficial/informal appearances to the contrary, the base LLMs close little to none of the gap to ChatGPT on tasks that are not heavily supported in the imitation data. It's not good news for the open LLM ecosystem.

4 comments

evrydayhustling 1121 days ago

This is a very weird type of paper. They take a specific approach, then make arguments about a broad class of approaches that are under constant development. The finding that distilled LLMs must be more specialized than the giant LLMs that train them is unsurprising; nobody at this point expects a 13B parameter model to succeed with the same accuracy at the broad range of tasks supported by what may be a 1T parameter model.

link

lebek 1121 days ago

> nobody at this point expects a 13B parameter model to succeed with the same accuracy at the broad range of tasks supported by what may be a 1T parameter model

I think a lot of people believe exactly that. To take one example from the "We Have No Moat" essay:

"It doesn’t take long before the cumulative effect of all of these fine-tunings overcomes starting off at a size disadvantage. Indeed, in terms of engineer-hours, the pace of improvement from these models vastly outstrips what we can do with our largest variants, and the best are already largely indistinguishable from ChatGPT." - https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

link

evrydayhustling 1121 days ago

That essay works in a context of specific datasets and tasks, which are referenced in the surrounding sentences and paragraphs. They are saying that for a particular "emergent" capability you might reach with a giant LLM, you might get there more efficiently with distillation / LoRa.

My comment is about generality, which is the remaining advantage of giant models.

link

int_19h 1121 days ago

That is exactly what people are expecting, and largely because of misleading metrics thrown around to claim ridiculous things like e.g. Vicuna-13b being nearly as good as GPT-3.5. It even shows up in the comments here, and if you go to any tangentially related subreddit, that's the kind of stuff that gets told as "everybody knows" to people setting up a local LLM for the first time.

link

evrydayhustling 1121 days ago

The Vicuna headline was def an overreach, although the main text admits pretty readily that their performance test (asking ChatGPT to evaluate quality) is not rigorous [1]. I'm sure that has set a lot of people pontificating about AGI with tiny models, but I can't imagine anyone who has worked directly with fine tuning having that impression.

The comments I see here are not about that. They are about small models succeeding at specific tasks, which is affirmed by this paper. Most applications of LLMs are not general purpose chat bots, so this is not bad news for most of the distill/fine tune community.

[1] https://lmsys.org/blog/2023-03-30-vicuna/

link

skybrian 1121 days ago

Even if they don't start out expecting that, people might be fooled by how it behaves when they try it out. So it seems useful to point out that initial impressions based on crowdsourced evaluations are misleading:

> We were initially surprised by how much imitation models improve over their base models: they are far better at following instructions, and their outputs appear similar to ChatGPT’s. This was further supported by both human and GPT-4 evaluations, where the outputs of our best imitation model were rated as competitive with ChatGPT (e.g., Figure 1, left).

("Competitive" meaning that 70% outputs seemed about as good.)

link

mdale 1121 days ago

I don't know if it's bad news per say. It helps to know where to deploy a tool, it's limitations and where to focus to build something competitive / better.

link

a0zU 1121 days ago

They don't even use more methodological evaluations of fine-tuned LLMs, they use metrics that are specifically built to support a (false) contrarian conclusion in order to generate attention for their "paper."

link

flangola7 1121 days ago

Good news for alignment though. This gives me a tiny amount of hope.

link

int_19h 1121 days ago

So, LLMs aligned with the interests of our corporate overlords and that nebulous "national security" thing that somehow always translates to more surveillance and less due process?

link

flangola7 1121 days ago

This tech has about as much chance to continue unregulated as highly enriched uranium. There is no future-path that includes unregulated AI.

I don't like horrific government abuse of residents,and I would not mind throwing most billionaire CEOs into a pool of alligators and dissolving their corporations. I don't like Altman, I think he's a smart person with NOBUS-level reckless hubris who is softballing the magnitude of the dangera to wet. The status quo is not good and it's getting worse.z

It doesn't matter. 5 people with launch-all-the-nukes buttons is better than 500 million.

link

clarge1120 1120 days ago

Fewer people agree with your premise, and that’s fortunate.

The “AI is dangerous” premise has no basis whatsoever. No one can prove it. No one can present a great thought experiment. Just doomsaying coupled with volume.

It’s starting to come off like a hidden agenda.

link

flangola7 1118 days ago

> Fewer people agree with your premise, and that’s fortunate.

Datacenter NVIDIA cards are already on the export control list for potential military use, and that was pre ChatGPT and GPT-4:

>On August 26, 2022, the U.S. government, or USG, informed NVIDIA Corporation, or the Company, that the USG has imposed a new license requirement, effective immediately, for any future export to China (including Hong Kong) and Russia of the Company’s A100 and forthcoming H100 integrated circuits. DGX or any other systems which incorporate A100 or H100 integrated circuits and the A100X are also covered by the new license requirement. The license requirement also includes any future NVIDIA integrated circuit achieving both peak performance and chip-to-chip I/O performance equal to or greater than thresholds that are roughly equivalent to the A100, as well as any system that includes those circuits. A license is required to export technology to support or develop covered products. The USG indicated that the new license requirement will address the risk that the covered products may be used in, or diverted to, a ‘military end use’ or ‘military end user’ in China and Russia. The Company does not sell products to customers in Russia.

https://www.sec.gov/Archives/edgar/data/1045810/000104581022...

> The “AI is dangerous” premise has no basis whatsoever. No one can prove it. No one can present a great thought experiment. Just doomsaying coupled with volume.

If you increase the number of persuasive Gobbels and hackers attacking infrastructure by 100,000x you do not come away with a better world.

> It’s starting to come off like a hidden agenda.

AI was used to fake the moon landing and hide bigfoot /s

link

tiberious726 1118 days ago

I've heard many people advancing this thesis (usually by exactly the people who would benefit from such regulation), but no cogent arguments for it. Why do you think modern large-scale statistics needs to be regulated?

link