|
|
|
|
|
by kolja005
969 days ago
|
|
I was a little confused about this too. The authors say in the paper: "The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens." I took a look at the HuggingFace implementation of ViT [1]. After the ViT encoder blocks there's a layer norm and then a pooling layer (line 595), where the pooling layer involves taking the first token output from the layer norm and running it through a dense layer. So, it looks like in PaLI-3 the tokens are the hidden states output by the layer norm after the ViT encoder blocks. [1] https://github.com/huggingface/transformers/blob/main/src/tr... |
|