Hacker News new | ask | show | jobs
by danenania 974 days ago
I wonder which model was used for this? Based on the poem taking "10 seconds" to generate, I'd guess the free version of ChatGPT, meaning 3.5 turbo.

While I wouldn't expect Atwood's conclusions to change too much by using GPT-4 instead, I think it's interesting that even the majority of educated people and journalists outside of tech don't seem to realize that the best model is at least 10x smarter than the free version of ChatGPT, which is what they seem to be using for all their prejudice-confirming "experiments".

They also always seem to assume that if the output from whatever prompt they came up with can't reach X quality bar, that means it can't be reached by anyone else either with a different prompting strategy.

Not trying to throw any shade toward Ms. Atwood, who is one of my favorite writers, and I'm also not claiming AI will be writing as well as her anytime soon... just pointing out that if we want to really measure where we're at on tasks like this one, a more rigorous approach is needed.

3 comments

> the best model is at least 10x smarter than the free version of ChatGPT

Citation needed. What does 10x smarter mean here? There’s an ongoing debate about whether the word “smart” even applies to a text prediction engine.

My gut metric says it's a ~20% increase in perceived interpretation and output complexity, whatever that means exactly. But there are plenty of eval result aggregators out there.
To me GPT-4 seems actually intelligent and reasoning capable while GPT-3.5 does not. Many of my usecases involve giving large bodies of text to GPT and asking to reason about this. 3.5 has no clue, but 4 seems to handle it intelligently.

Overall it is as if GPT3.5 feels just like a clueless summarizer, but GPT4 intelligent interpreter and reasoner that I can trust.

Depending on which way you look at it, it could be 10x or 1000x the intelligence.

I think trust is a key thing you've hightlighted. I find myself doubting GPT3.5, whereas not at all for GPT4.
Yeah, there are measurable results on things like AP bio. And those are definitely not 10x.
I hear this a lot. I didn't notice a huge difference in quality with GPT4. Completely anecdotal, and could have been a failure to effectively prompt for that model. But I don't think it's safe to assume the results are 10x improvement.
I have. I don't propose some kind of scientific measure but I do have two data points to contribute:

First, I've been using GPT to build an application for work for the past few months and anything but GPT-4 consistently produces less consistent and reliable output. Things like occasionally producing malformed JSON.

Second, I have a set of questions I use to evaluate models testing different capabilities and GPT-4 does much better than other models, particularly at coding tasks. There are some exceptions, for example, Bard has been able to do better on stating facts sometimes and Claude has done better at summarizing long text.

I'd love to have another model as good as GPT-4 to use but I haven't found one yet.

> think it's interesting that even the majority of educated people and journalists outside of tech don't seem to realize that the best model is at least 10x smarter than the free version of ChatGPT

I mean... the content-free drivel they generate is more _polished_, possibly, though I'm not sure this is actually an improvement. What do you mean by 'smarter', here?