Hacker News new | ask | show | jobs
by briga 1052 days ago
One thing to note when making comparisons like this is that LLM output is not deterministic, in the sense that if you ask it the same question 10 times you will get 10 different answers. So the question to ask is not, “is GPT4 better on this one specific question?”, but rather “does GPT4 produce better results on average?”. I would bet that it does, for no other reason than that it is much larger, and LLM performance seems to just scale with size. Also worth noting is that the more detailed your prompt is the better the response will be. Sometimes you have to encourage GPT to get the best results. GPT4 should be able yo handle much more complex and detailed prompts than 3.5
3 comments

An unintuitive consequence of this nondeterminism over millions of interactions is that different individuals will see different trends. IME the quality of response is accurately modeled by "luck", and people's luck can change.

So we have different population of GPT users.

An average experience might be to get a mixture of spot-on helpful responses and obvious bullshit^H^H^Hallucinations, this population might learn what questions to ask given the limitations of the model. This is really a best case scenario as people can actually get a feel for how to use the technology, strengths and weaknesses etc.

Personally my experience was the first few dozen times I used it I was amazed at the responses, I was on team superintelligence, anyone who is getting lackluster responses is just holding it wrong. But luck changes and over months of use I see now that on average the responses are just OK. But this is the case that leads to disappointment and bitter conspiracy (the superintelligence is being suppressed, give it back!)

Another population had rotten luck to begin with, and got dumb unhelpful response over and over. This population quickly determined that the AI was all hype and stopped exploring (you don't keep going back to the casino if you lose everything your first time...).

This divergence is destructive to the larger discourse, since we have fanboys flummoxed by naysayers and critics bamboozled by hype beasts.

Interestingly enough, I don't think this applies to the APIs as much.

What I've seen on indie hacker type website is that developers are fully on this train and not very critical of the outputs.

This is why you get very basic prompts sent by "wrapper apps", which might have given the developer a good result the only time it was tested before being put in production.

I think it might take a while before tools show up that can generate 100 test cases and test a given prompt with all 100 to report on the results... It seems to be a tough problem to crack.

IMHO front-end chat end-users have many many more "at-bats" and get to see more model results than devs do, which make them more critical of those results.

GPT-4 has a capped number of responses, it also costs $20/month. If it's only marginally better than GPT-3.5, why would I pay for it?
That is entirely up to your discretion, if 3.5 fits your use case just use that
I’ve seen this stated a ton and it’s not really true. Once trained, the model (except for decoding) is deterministic, and you can enforce determinism fairly easily. ChatGPT is not deterministic at the chat window but that’s not inherent to the model.