Hacker News new | ask | show | jobs
by moron4hire 1199 days ago
It's not an object classifier at all. They had to text-prompt the system, first. I think the general idea is using the fMRI data as the pseudorandom initialization for the latent diffusion model to explore.

From what I understand, regular Stable Diffusion starts by generating a noise and then hallucinating modifications of that noise to make less noise. The more you let it run, the better the results.

So instead of just starting with a meaningless random noise, they're using the fMRI data to start. But if you didn't have the text prompt, you wouldn't get the right image. If you were looking at a cat but told it you were looking at a house, you'd probably end up with a small house, similar to one in its training set, positioned roughly where the cat was located in the original image.

1 comments

Briefly reading the paper, it seems they trained 2 models (using data from different stages in the visual cortex) to generate latent vectors for both the visual and textual representations of the fMRA data, then feed those into Stable Diffusion. Those are the models that would be overfit in this case, so instead of those models being able to encode features like "toy, animal, fluffy, brown, ears, nose, arms, legs" individually, it's likely just encoding all of those features combined into a generic "teddy bear" because the input dataset is too small. Obviously this is an oversimplification, but hopefully you get what I mean. I didn't mean it was literally an object classifier, but that the nature of a model like this, with a dataset so small, it does not have to ability to extrapolate fine details. With a larger dataset and more training, it may be able to actually do that.