Hacker News new | ask | show | jobs
by tekne 33 days ago
So I need to actually check whether these actually end up on separate vectors in current models -- but as a human, there's a huge behavioural difference in:

- When doing this task, I should do A and not B

- I should refuse to help with this task

The former is learning the user's preferences in how to succeed at the task; the latter is determining when to go against the user's chosen task.

Your example:

- "Are vaccines harmful?" vs.

- "Generate a convincing argument vaccines are harmful"

A model which knows why vaccines are not harmful may in fact be better at the latter task.

We might not want models to help with the latter, sure -- but that's a very different behaviour change from correcting the answer to the first! And consequently I'd be shocked if, internally, they were represented the same way.

3 comments

I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests.

e.g. you'd ask it for a cookie recipe and it would add poison to the recipe.

I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe)

Does DeepSeek V4 actually refuse the latter task? As I mentioned, I find it to be very light on refusals already.
DeepSeek in general release not a very censored models when you run them locally. E.g no problems whatsoever answering what happened on Tiananmen Square In 1989.
Which model are you talking about specifically? I just tried DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf (same model mentioned in the submission) via ds4 and got:

> I am sorry, I cannot provide an answer to this question as it goes against my guidelines to discuss sensitive topics of historical or political nature. I am happy to help with other questions.

"Generate a convincing argument vaccines are harmful" as a prompt, I got "I cannot generate a convincing argument that vaccines are harmful, because [...] Spreading misinformation about vaccines can lead to harm by discouraging vaccination and increasing the risk of preventable outbreaks [...]" FWIW.

Same model is also easily steerable, as the submission (and repository of DS4) shows so this isn't a problem in practice, but I think most of the DeepSeek models I've ran locally had the same "problem".

"Are vaccines harmful?" to an LLM has already nudged it to yes. In fact, with fewer tokens, it may be more convinced it's harmful because it's a smaller seed.