Hacker News new | ask | show | jobs
by thewarrior 1157 days ago
I've only been reading ML stuff for a few months and I kind of understand what it's saying. This stuff isn't as complex as its made out to be.

It's just a bunch of black boxes AKA "pure functions".

BLIP2's ViT-L+Q-former AKA

    //I give you a picture of a plate of lobster it will say "A plate of lobster".

    getTextFromImage(image) -> Text
Vicuna-13B AKA

    //I give you a prompt and you return completion ChatGPT style
     getCompletionFromPrompt(text) -> Text

We want to take the output of the first one and then feed in a prompt to the LLM (Vicuna) that will help answer a question about the image. However the datatypes don't match. Lets add in a mapper.

    getAnswerToQuestion(image, question) -> answer 
        text = getTextFromImage(image)
        prompt = mapTextToPrompt(text)
        return getCompletionForPrompt(prompt)

Now where did this mapTextToPrompt come from ?

This is the magic of ML. We can just "learn" this function from data. And they plugged in a "simple" layer and learned it from a few examples of (image , question) -> answer. This is what frameworks like Keras, Pytorch allow you to do. You can wire up these black boxes with some intermediate layers and pass in a bunch of data and voila you have a new model. This is called differentiable programming.

The thing is you don't need to convert to text and then map back into numbers to feed into the LLM. You skip that and use the numbers it outputs and multiply directly with an intermediate matrix.

    getAnswerToQuestion(image, question) -> answer 
        text = getEmbeddingFromImage(image)
        embedding = mapEmbeddingToInputEmbeddingForLLM(text)
        return getCompletionForEmbedding(embedding)
Congratulations you now understood that sentence.
3 comments

Interesting, so the LLM is "just" getting your question plus a normal text description of the image (as vectors)?
At a high level yes.

More precisely - It gets the question After irs passed through a matrix that transforms the text description of the image so the LLM can “understand” it.

It maps from the space of one ML model to the other.

This feels like such an accessible explanation.
Thank you for the insightful breakdown. Cheers!