|
|
|
|
|
by foota
2 days ago
|
|
I find the way that models understand images to be seriously lacking. The root cause of the issue as I see it is that image encoding isn't contextual. The encoder should be aware of the prompt so that it can encode the right things. It seems like this should be something that could be trained into a model. |
|