Hacker News new | ask | show | jobs
by _sword 841 days ago
At this point I wonder how much of the GPT-4 advantage has been OpenAI's pre-training data advantage vs. fundamental advancements in theory or engineering. Has OpenAI mastered deep nuances others are missing? Or is their data set large enough that most test-cases are already a sub-set of their pre-training data?
5 comments

More than pretraining data, I think the advantage was ChatGPT and how quickly it grew. Remember it was 3.5, and within a month or two, it generated so many actual q&a pairs with rating, feedback, and production level data of how a model will be used by actual users. Those queries and subsequent RLHF + generating better answers for the questions meant the model would have been improved a lot at the SFT stage. Think this is the reason why Anthropic, Google, and Mistral, all three launched their own chatbots, all providing it to users for free and getting realtime q&a data for them to finetune the models on. Google did it with bard too, but it was so bad that not many used it.
My understanding is that GPT-4 had been almost fully trained before ChatGPT was released - they spent around six months testing GPT-4 before making it available to the public, ChatGPT came out 31st November 2022, GPT-4 came out March 14th 2023.

But maybe that was still enough time for them to instruction tune it based on ChatGPT feedback, or at least to focus more of their fine tuning iteration in the areas they learned were strong or weak for 3.5 based on ChatGPT usage?

I don't think it was pretrained on knowledge gaps. A version was already available in testing w select customers. The version released to the public would definitely have feedback from those customers, and finetuned/instruction tuned on the data from ChatGPT.

Training data is publicly available internet (and accessible to everyone). It's the SFT step w high quality examples which determines how well a model is able to answer questions. ChatGPT's virality played a part in that in the sense that OAI got the real world examples + feedback others did not have. And yeah, it would have been logical to focus on 3.5's weaknesses too. From Karpathy's videos, it seems they hired a contractual labelling firm to generate q&a pairs.

Also, worth to remind that Bing Chat was launched in February 7 with GPT4 already.
I'd guess a bit of both, perhaps more on the data side. One could also flip the question and ask how is this new Anthropic model able to beat GPT-4 in some benchmarks?

As far as data, OpenAI haven't just scraped/bought existing data, they have also on a fairly large scale (hundreds of contractors) had custom datasets created, which is another area they may have a head start unless others can find different ways around this (e.g. synthetic data, or filtering for data quality).

Altman has previously said (on Lex's podcast I think) that OpenAI (paraphrasing) is all about results and have used some ad-hoc approaches to achieve that, without hinting at what those might be. But, given how fast others like Anthropic and Google are catching up I'd assume each has their own bag of tricks too, whether that comes down to data and training or architectural tweaks.

There was a period of time where data was easily accessible, and Open AI suctioned up as much of it as possible. Places have locked the doors since then realizing someone was raiding their pantry.

To get that dataset now would take significantly more expense.

I would have thought that Anna's Archive is still the best source of high quality tokens and that is fully open.
This may explain the substantial performance increase in proprietary models over the last 6 months. It also may explain why open-air and others had to drop open models. Distributing copyrighted material via model weights would be problematic.
So far gpt is the only one able to answer to variations of these prompts https://www.lesswrong.com/posts/EHbJ69JDs4suovpLw/testing-pa... it might be trained on these but still you can create variations and get decent responses

Most other model fail on basic stuff like the python creator on stack overflow question, they identify Guido as the python creator, so the knowledge is there, but they don't make the connection.

>>So far gpt is the only one able to answer to variations of these prompts

You're saying that when Mistral Large launched last week you tested it on (among other things) explaining jokes?

Sorry I did what? When?
You linked to a lesswrong post with prompts asking the AI to explain jokes (among other tasks?) and said only Openai models can do it, didn't you? I'm confused why you said only OpenAI models can do it?
Ah sorry if it wasn't clear below the jokes there are a few inferring posts and so far yeah didn't see Claude or other to reason the same way as palm or gpt4, (gpt3.5 did got some wrong), haven't had time tho to test mistral large yet. Mixtral didn't get the right. Tho.