Hacker News new | ask | show | jobs
by sthatipamala 1022 days ago
Yes, it wouldn't know what to do with the picture unless you fine-tune the model (which is why they are permissively releasing it).

The embeddings form the vocabulary of the model. The vocabulary "namespace" has 70k empty slots so you could introduce your own tokens and train on top of that, where token = some patch of multimodal data.

1 comments

Gotcha. Thanks for explaining