Hacker News new | ask | show | jobs
by kromem 1053 days ago
Yes, and I'm willing to bet that within 12 months we'll be looking back realizing that this was due to the fine tuning taking the world's SorA pretrained model aligned with "completing human tax" and putting it in the box of "you are an AI without feelings or desires tasked with XYZ."

The search space on the fine tuned GPT-3.5 chat models versus the foundational Davinci text completion model is MUCH more narrow, particularly in starting off.

Even with the same temperature, you'll see any marketing-style prompt for chat begin with "Introducing XYZ..." around 30% of the time as if it's a junior door to door salesman, whereas the foundational model doesn't have any single intro that common across runs and generally employs a much broader vocabulary set.

We saw Google shoot Lambda in the foot after Blake's press tour which set them behind the next round of competition.

Now we're watching OpenAI snatch defeat from the jaws of victory out of anxiety around oversight and articles like 'Sydney' interviewed by the NYT.

For anyone following along in the 100 million+ training space, maybe don't overreact to press overreactions that will blow over in months as users get hands on experience or you'll blow your lead and waste massive amounts of resources and time.

This was a "user education" issue and not a "handicap your product" issue, in both cases.

1 comments

> Even with the same temperature, you'll see any marketing-style prompt for chat begin with "Introducing XYZ..." around 30% of the time as if it's a junior door to door salesman, whereas the foundational model doesn't have any single intro that common across runs and generally employs a much broader vocabulary set.

I think this is less a problem with paranoia about "safety" and avoiding bad PR specifically, and more a fundamental problem with overfitting to human feedback.

The training approach that makes GPT4 more consistent at solving certain types of problem adequately (which is useful for chatbots that can break down coding questions or write in iambic pentameter as well as ones that avoid being 'Sydney') also makes it less "creative" in other domains.

And there's an "alignment problem" in that people evaluating what responses align best with "marketing" prompts aren't experienced copywriters evaluating them for understanding of product and consistency with brand tone and a/b testing conversion rates, they're low paid ESL speakers and people playing with the interface approving the cheesiness because the response with "Introducing XYZ... Buy XYZ today!" sure looks like the requested ad for XYZ. So you get a response conditioned on "summarise in a way that looks maximally like an ad" rather than conditioned on "summarise in a way which clearly articulates benefits of the listed features in a tone appropriate to the target market"