| I've only been reading ML stuff for a few months and I kind of understand what it's saying. This stuff isn't as complex as its made out to be. It's just a bunch of black boxes AKA "pure functions". BLIP2's ViT-L+Q-former AKA //I give you a picture of a plate of lobster it will say "A plate of lobster".
getTextFromImage(image) -> Text
Vicuna-13B AKA //I give you a prompt and you return completion ChatGPT style
getCompletionFromPrompt(text) -> Text
We want to take the output of the first one and then feed in a prompt to the LLM (Vicuna) that will help answer a question about the image. However the datatypes don't match. Lets add in a mapper. getAnswerToQuestion(image, question) -> answer
text = getTextFromImage(image)
prompt = mapTextToPrompt(text)
return getCompletionForPrompt(prompt)
Now where did this mapTextToPrompt come from ?This is the magic of ML. We can just "learn" this function from data. And they plugged in a "simple" layer and learned it from a few examples of (image , question) -> answer. This is what frameworks like Keras, Pytorch allow you to do. You can wire up these black boxes with some intermediate layers and pass in a bunch of data and voila you have a new model. This is called differentiable programming. The thing is you don't need to convert to text and then map back into numbers to feed into the LLM. You skip that and use the numbers it outputs and multiply directly with an intermediate matrix. getAnswerToQuestion(image, question) -> answer
text = getEmbeddingFromImage(image)
embedding = mapEmbeddingToInputEmbeddingForLLM(text)
return getCompletionForEmbedding(embedding)
Congratulations you now understood that sentence. |