|
|
|
|
|
by anon291
15 days ago
|
|
I mean I can't know for sure but I'm pretty sure that by the time the upper layers of the network are reached the lower level networks have already transformed the image tiles into proper position encoded embeddings of the tokens in the words in the image. That would be my operating assumption at least. |
|