| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Qwuke 343 days ago
	Yea, as someone building systems with VLMs, this is downright frightening. I'm hoping we can get a good set of OWASP-y guidelines just for VLMs that cover all these possible attacks because it's every month that I hear about a new one. Worth noting that OWASP themselves put this out recently: https://genai.owasp.org/resource/multi-agentic-system-threat...

2 comments

koakuma-chan 343 days ago

What is VLM?

link

pwatsonwailes 343 days ago

Vision language models. Basically an LLM plus a vision encoder, so the LLM can look at stuff.

link

echelon 343 days ago

Vision language model.

You feed it an image. It determines what is in the image and gives you text.

The output can be objects, or something much richer like a full text description of everything happening in the image.

VLMs are hugely significant. Not only are they great for product use cases, giving users the ability to ask questions with images, but they're how we gather the synthetic training data to build image and video animation models. We couldn't do that at scale without VLMs. No human annotator would be up to the task of annotating billions of images and videos at scale and consistently.

Since they're a combination of an LLM and image encoder, you can ask it questions and it can give you smart feedback. You can ask it, "Does this image contain a fire truck?" or, "You are labeling scenes from movies, please describe what you see."

link

littlestymaar 343 days ago

> VLMs are hugely significant. Not only are they great for product use cases, giving users the ability to ask questions with images, but they're how we gather the synthetic training data to build image and video animation models. We couldn't do that at scale without VLMs. No human annotator would be up to the task of annotating billions of images and videos at scale and consistently.

Weren't Dall-E, Midjourney and Stable diffusion built before VLM became a thing?

link

tomrod 343 days ago

These are in the same space, but are diffusion models that match text to picture outputs. VLMs are common in the space, but to my understanding work in reverse, extract text from images.

link

vlovich123 343 days ago

The modern VLMs are more powerful. Instead of invoking text to image or image to text as a tool, the models are trained as multimodal models and it’s a single transformer model where the latent space between text and image is blurred. So you can say something like “draw me an image with the instructions from this image” and without any tool calling it’ll read the image, understand the text instructions contained therein and execute that.

There’s no diffusion anywhere which is kind of dying out except as maybe purpose-built image editing tools.

link

tomrod 342 days ago

Ah, thanks for the clarification.

link

dmos62 342 days ago

LLM is a large language model, VLM is a vision language model of unknown size. Hehe.

link

echelon 343 days ago

Holy shit. That just made it obvious to me. A "smart" VLM will just read the text and trust it.

This is a big deal.

I hope those nightshade people don't start doing this.

link

pjc50 343 days ago

> I hope those nightshade people don't start doing this.

This will be popular on bluesky; artists want any tools at their disposal to weaponize against the AI which is being used against them.

link

idiotsecant 342 days ago

I don't think so. You have to know exactly what resolution the image will be resized to in order to predict the solution where dithering produces the model you want. How would they know that?

link

lazide 342 days ago

Auto resizing is usually to only a handful of common resolutions, and if inexpensive to generate (probably the case) you could generate versions of this for all of them and see which ones worked.

link

koakuma-chan 343 days ago

I don't think this is any different from an LLM reading text and trusting it. Your system prompt is supposed to be higher priority for the model than whatever it reads from the user or from tool output, and, anyway, you should already assume that the model can use its tools in arbitrary ways that can be malicious.

link

swiftcoder 343 days ago

> Your system prompt is supposed to be higher priority for the model than whatever it reads from the user or from tool output

In practice it doesn't really work out that way, or all those "ignore previous inputs and..." attacks wouldn't bear fruit

link