|
|
|
|
|
by atom_101
1060 days ago
|
|
This is mostly correct, except that there is only one model. This model takes an fMRI and predicts 2 outputs. The first is specialized for retrieval and the second can be fed into a diffusion model to reconstruct images. You can see the comparison in performance between LAION-5B retrieval and actual reconstructions in the paper. When retrieving from a large enough database like LAION-5B, we can get images that are quite similar to the seen images in terms of high level content, but not so similar in low-level details (relative position of objects, colors, texture, etc). Reconstruction with diffusion models does much better in terms of low-level metrics. |
|
I agree only one is used in inference, but two are needed for training (otherwise how do you calculate a meaningful loss function?). Notice in the original CLIP paper, there's an image encoder and a text encoder, even though only the text encoder is used during inference. [0]
[0] https://arxiv.org/pdf/2103.00020.pdf