| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ineedasername 261 days ago

These attempted limitations tend to be very brittle when the material isn’t excised from the training data, even more so when it’s visual rather than just text. It becomes very much like that board game Taboo where the goal is to get people to guess a word without saying a few other highly related words or synonyms.

For example, I had no problem getting the desired results when I promoted Sora for “A street level view of that magical castle in a Florida amusement area, crowds of people walking and a monorail going by on tracks overhead.”

Hint: it wasn’t Universal Studios, and unless you know the place by blind sight you’d think it had been the mouse’s own place.

On pure image generation, I forget which model, one derived from stable diffusion though, there was clearly a trained unweighting of Mickey Mouse such that you couldn’t get him to appear by name, but go at it a little sideways? Even just “Minnie Mouse and her partner”? Poof- guardrails down. If you have a solid intuition of the term “dog whistling” and how it’s done, it all becomes trivial.

2 comments

timschmidt 261 days ago

Absolutely. Though the smarter these things get, and the more layers of additional LLMs on top playing copyright police that there are, I do expect it to get more challenging.

My comment was intended more to point out that copyright cartels are a competitive liability for AI corps based in "the west". Groups who can train models on all available culture without limitation will produce more capable models with less friction for generating content that people want.

People have strong opinions about whether or not this is morally defensible. I'm not commenting on that either way. Just pointing out the reality of it.

link

TeMPOraL 261 days ago

It's a matter of time. I imagine they'll get more effect suppressing activations of specific concepts within the LLM, possibly in real time. I.e. instead of filtering prompt for "Mickie Mouse" analogies, or unlearning the concept, or even checking the output before passing it to user, they could monitor the network for specific activation patterns and clamp them during inference.

link

ineedasername 259 days ago

They might, but we may also find they don’t function as well or as predictably if increasing amounts of their weights are suppressed. Research has so far shown that knowledge is incredibly, vastly diffuse, as are causes of different behaviors. There was some research that came out of Anthropic where a model being taught number sequences by another model, and that second model had fine tuning which with a stated preference for owls. The student model, despite no overt exposure to anything of the sort, expressed the same preference. The subtlety of influence that even very minor things have on the vast network of weights is, at least at present, too poorly understood to know what we’re getting in the bargain when holes are poked.

link

moduspol 260 days ago

I can get it to do rides at Disney World (including explicitly by name) but it’s incredibly good at blocking superheroes. And that’s gotta be a pretty common prompt, yet I haven’t seen that kind of content in the feed, either.

And not just by name. Try to get it to generate the Hulk, even with roundabout descriptions. You can get past the initial (prompt-level) blocking, but it’ll generate the video and then say the guardrails caught it.

link