Hacker News new | ask | show | jobs
by malcolm_rynlds 1519 days ago
Hi, one of the authors here.

The thing to bear in mind when reading the dialogue examples in figure 11 is the custom prompt shown in Appendix D:

``` This is a conversation between a human, User, and an intelligent visual AI, Flamingo. User sends images, and Flamingo describes them. User: <a cat image> Flamingo: That is a cat. It’s a tiny kitten with really cute big ears. User: <a dinner image> Flamingo: This is a picture of a group of people having dinner. They are having a great time! User: Can you guess what are they celebrating? Flamingo: They might be celebrating the end of a successful project or maybe a birthday? User: <a graph image> Flamingo: This is a graph, it looks like a cumulative density function graph. ```

My personal opinion would be, once you're doing next token prediction with this description of what Flamingo "is" in the history, then "I am not affected by this difference" is a pretty reasonable completion rather than a lucky guess. It definitely was exciting for the team that this whole example worked so nicely, but if you discard the visual side, this "illusion of an unbelievable capacity" has been seen in other works as well.

1 comments

Yeah, I didn't realize there was that additional prompt. It certainly makes more sense that if the prompt includes the description that the agent is playing the role of an AI, it would be able to deduce that it would not be affected. I was assuming there was no such indication, and so the system would be implicitly trying to predict what a human would say in that situation (since the training data is largely human-written text).

Still, you could imagine a parallel version of Flamingo which performs the same reasoning, but is artificially slowed when shown Stroop images. Obviously, there would be no way for it to deduce this fact from the training data, and it would not be able to say that it is also affected as humans are.

I don't mean to say that this is some great failing of the system or anything - just that a casual reader might infer from the Stroop dialogue that the system had some way to inspect and reason about its own performance, when in fact it's just estimating what it thinks would be true for AI systems (since it was told that it's an AI system in the prompt) in general based on the training corpus.