I can understand that this works for cats or cars, but how does this work for images that are not in any training set yet? E.g. highly specialized images like x-ray pictures, or astronomy pictures?
As I understand it, the point is that these models while they are _trained_ on identifying cats or cars, because they have soon so much variation during training have internalised very different concepts to help come up with "its a cat". The idea then is to take all of these pre-trained weights that let you build this classifier, but then add your own custom head on the front of this network. This saves you doing a huge amount of training for what is essentially feature extraction - that part is already done. All you need to do is just add a bit more training that works out how to use these learnt features. I could be way off the mark, but that's how I understand it.
Yes, your understanding is correct. However, instead of adding a head on top of the network, most fine-tuning is currently done with LoRA (https://github.com/microsoft/LoRA). This introduces low-rank matrices between different layers of your models, those are then trained using your training data while the rest of the models' weights are frozen.
Foundational models are generally trained on internet scale level of data. They have seen billions of images, so they would have seen some medical images. For example, extracted from public datasets or textbooks. However, indeed, they may not be specialized to your use case. You could still fine-tune the model with a couple of examples to be more tailored to what you desire. Having a foundation model does not exclude training and your data could still be valuable. Indeed, you could achieve better performance by fine-tuning the larger model than just using your training data alone to train a model from scratch.
Also for the medical domain, I think vision-text segmentation models like SEEM (https://github.com/UX-Decoder/Segment-Everything-Everywhere-...) are really cool. You could for example ask “Where is the tumor located on that image?” and then the tumor is highlighted in the picture.