Hacker News new | ask | show | jobs
by mk_stjames 784 days ago
I went through all the comments here and I'm still not seeing anyone address this:

If I am reading this person correctly... they prompted the model with the prompt directly 1000 times... but only for the first time. They did not allow the model to actually run a context for chat. Simply, output the first in a list of 'left' and 'right' and favor 'left' 80% of the time... but then the author only asked for the first output.

This person doesn't understand how LLMs and their output sampling works. Or they do and they still just decided to go with their method here because of course it works this way.

The model takes the prompt. The first following output token it chooses, for this specfic model, happens to be 'Left'. They shut down the prompt and prompt again. Of course the next output will be 'Left'. They aren't letting it run in context and continue. The temperature of the model is low enough that the sampler is going to always pick the 'Left' token, or at least 999/1000 in this case. It cannot start to do an 80/20 split of Left/Right if you never give it a chance to start counting in context. Continuously stopping the prompt and re-promting will, of course, just run the same thing again.

I can't tell if the author understands this and is pontificating on purpose or if the author doesn't understand this and is trying to make some profound statement on what LLMs can't do when... anyone who knows how the model inference runs could have told you this.

6 comments

I think this is overthinking it. ChatGPT is billed as a general-purpose question-answerer, as are its competitors. A regular user shouldn't have to care how it works, or know anything about context or temperature or whatever. They ask a question, it answers and appears to have given a plausible answer, but doesn't actually do the task that was asked for and that it appears to do. What the technical reasons are that it can't do the thing are interesting, but not the point.
But it it like asking a person for them to generate the same thing, but when the start to list off their answers stopping them by throwing up your hand after their first response, writing that down, and then going back in time and asking that person to do the same thing, and stopping them again, and repeat- and then being surprised after 1000 times that you didn't get a reasonable distribution.

Meanwhile if you let the person actually continue, you may actually get 'Left..... Left Right Left Left... etc'.

If You asked me to pick a random number between one and six and ignore all previous attempts, I would roll a die and you would get a uniform distribution (or at least not 99% the same number).

If you are saying that this thing can't generate random numbers on the first try then it can't generate random numbers. Which makes sense. Computers have a really hard time with random, and that's why every computer science course makes it clear that we're doing pseudo random most of the time.

>If You asked me to pick a random number between one and six and ignore all previous attempts, I would roll a die and you would get a uniform distribution

I believe what GP is getting at is that if you didn't have a die, and you truly ignored all your previous attempts to the point of genuinely forgetting that the question had been asked, then your answer would likely be the same every time. Imagine asking a person with severe Alzheimer's to pick a number, then asking again a few minutes later. You'd probably get the same answer.

You’re forgetting the background neutrino flux that we tap into for randomness.
Oh, is that a hidden feature of the flux capacitor?
Yeah, I get what they are saying and there's no reason to believe that.

The die is an analogy for our decision making. They are implicitly claiming that randomness must come in an order and that simply isn't how randomness works or even halfway decent pseudo randomness.

Any system whether it be a die, a person, or an llm doesn't have to know about its previous random choices to make random choices that follow some distribution going forward presuming it's actually capable of randomness.

I'm just impressed that it answered "left" rather than outputting python code that I could run which would sample the list ["left", "right"] with an 80% bias.
It's ChatGPT, not ThinkGPT
I think __you__ are misunderstanding the experiment and what it is testing. Yours (with context) would be a different experiment and it would be interesting. But that lets the LLM count and I'm willing to bet if it does get it correct that there are unlikely to be long sequences of the same number like you'd see in a real 1k coin flip.

The author is testing for bias. This is no different to the test of "pick a number" and finding the bias of "42".

It is also testing for reasoning and logic. This is a blind consensus building exercise. Surprisingly humans do decently well on this. But it does require reasoning and collaborative forecasting. No matter the strategy the person picks, there is more going on at play than just a random selection, even if they think it is. And looking at the results, it does not seem like asking different LLMs to perform this would get the right result, as they are all biased in the same way.

What is the context window of a human?
Humans don't work like that
Thanks for the comment.

My point - which you actually make for me at the end of your comment - is that this stuff is probably intuitive to an NLP practitioner but not to a lay-person and therefore there's a kind of education/awareness piece which is what I'm trying to do here. There's no profound statements being made and I like to think I know my way around this stuff pretty well.

Based on the feedback I've added an update with couple of new experiments where I play with multi-turn contexts. With true RNG this shouldn't make a difference but with LLMs and the way that they use context, I figured you're probably right - it's worth trying.

Looks like (and again probably no big surprise for those of us who are familiar with these systems) multi-turn probabilistic behaviour is still not in line with what was asked for within the prompt.

I’d bet the author does understand.

I’m fairly certain subsequent runs even with the same prompt will make selections with different entropy. Even with low temperature, there’s no reason a good enough model couldn’t start such a list with “Right” 20% of the time.

Isn't this way of prompting roughly equal to asking a 1000 people to pick left or right with 80% prob of left? I imagine, the result with humans will be closer to 80:20 than whatever happened with the LLM.
I agree it’s equivalent, and that’s a great way to think about it.

But… I wouldn’t be surprised if humans answered closer to the LLM results than 80:20. I’d actually be surprised if humans converged very close to the right result.

Would be a fun mechanical Turk experiment to run.

> Would be a fun mechanical Turk experiment to run.

Sounds like it would be a good way to determine likelihood of MTurk users being LLMs.

Kinda, if those people were clones, with the exact same starting internal state, a d no memory of this ask or ability communicate with each other

Seed a PRNG and it'll return the same result to get random() no matter how many times you execute the test case

> The model takes the prompt. The first following output token it chooses, for this specfic model, happens to be 'Left'. They shut down the prompt and prompt again. Of course the next output will be 'Left'.

I think what you say is correct, but isn’t this exactly the point the author is trying to make? You explain why it happens, but ultimately the result is still that the LLMs can’t do probability. At least not in the way the author presents it.

What I’m more interested in is what sort of consequences these debates and different understandings/views of LLMs will have on us in general. We monitor the usage of co-pilot in our organisation and it’s been interesting to watch how a lot of employees have started promoting the same task to multiple “agents” simultaneously. One person tends to always run at least five at once. So naturally we dug a little into why people were doing it, and, they are doing it because “it’s the fastest way to get something useful”. Which is probably not too surprising to a lot of you, but it was sort of hilarious to follow along and see how two promoted “agents” gave completely conflicting answers.

I know LLMs will probably get better at being lucky, but some of these tasks… honestly we had a couple of our juniors who were very you’re of always going GPT of co-pilot first, simply look at the official documentation and we timed them, and it was so much faster for them to not use LLMs. This isn’t me saying LLMs suck, we deliberately did it to make sure the juniors in question had an “oh” moment. But it’s interesting to see how quickly our employees have adopted LLMs.

Yeah increasingly I only use LLMs for very simple things, and still I think at least 25% of the time I end up reading the doc I was trying to avoid anyway.