| Wasn't there something similar a few months ago on HN and where the top comment talked about how it's not as impressive as it sounds [0]? The main issue is that this type of methodology is pulling from a pool of images, not literally reconstructing what image was seen in the brain directly. > I immediately found the results suspect, and think I have found what is actually going on. The dataset it was trained on was 2770 images, minus 982 of those used for validation. I posit that the system did not actually read any pictures from the brains, but simply overfitted all the training images into the network itself. For example, if one looks at a picture of a teddy bear, you'd get an overfitted picture of another teddy bear from the training dataset instead. > The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image. > If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences. > 1. https://i.imgur.com/ILCD2Mu.png > 2. https://i.imgur.com/ftMlGq8.png [0] https://news.ycombinator.com/item?id=35012981 |
So we are doing both reconstruction and retrieval.
The reconstruction achieves SOTA results. The retrieval demonstrates that the image embeddings contain fine-grained information, not just saying it's just a picture of a teddy bear and then the diffusion model just generates a random teddy bear picture.
I think the zebra example really highlights that. The image embedding generated matches the exact zebra image that was seen by the person. If the model only could say it's just a zebra picture, it wouldn't be able to do that. But the model is picking up on fine-grained info present in the fMRI signal.
The blog post has more information and the paper itself has even more information so please check it out! :)