| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Der_Einzige 680 days ago
	You should also mention that before you had done custom alignment accounting for this feature, that it was an excellent alignment breaker (therefor a big no-no to release too early) For example, if I ask an LLM to generate social security numbers, it will give the whole "I'm sorry Hal, I can't do that". If I ban all tokens except numbers and hyphens, prior to your "refusal = True" approach, it was guaranteed that even "aligned" models would generate what appeared to be social security numbers.

1 comments

ethical_source 680 days ago

And if LLMs can generate plausible social security numbers, our civilization will fall /s

Christ, I hate the AI safety people who brain-damage models so that they refuse to do things trivial to do by other means. Is LLM censorship preventing bad actors from generating social security numbers? Obviously not. THEN WHY DOES DAMAGING AN LLM TO MAKE IT REFUSE THIS TASK MAKE CIVILIZATION BETTER OFF?

History will not be kind to safetyist luddites.

link

Terretta 680 days ago

I'm less concerned with the AI teams lobotomizing utility, more concerned with damage to language, particularly redefining the term "safe" to mean something like "what we deem suitable".

That said, when zero "safety" is at stake might be the best time to experiment with how to build and where to put safety latches, for when we get to a point we mean actual safety. I'm even OK with models that default to parental control for practice provided it can be switched off.

link

consteval 679 days ago

I don't think it's about better off or worse off, it's about PR and perception.

The wider consumer base is practically itching for a reason to avoid this tech. If it gets out, that could be a problem.

It was the same issue with Geminis image gen. Sure, the problem Google had was bad. But could you even imagine what it would've been if they did nothing? Meaning, the model could generate horrendously racist images? That 100% would've been a worse PR outcome. Imagine someone gens a mistral image and accredits it to Google's model. Like... that's bad for Google. Really bad.

link