| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by adebayoj 115 days ago

op here. Important point, but I disagree. We see explainability/interpretability as a CORE need for AI safety. We believe you can't align/audit/debug/fix a system that you don't understand.

Just to give you some answers for what we can do:

1) We can find the training data that is causing a model to output toxic/unwanted text and correct it. 2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept.

Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of .

For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this!

3 comments

vintagedave 115 days ago

This is fantastic to read. LLMs feel like black boxes and for the large ones especially I have a sense they genuinely form concepts. Yet the internals were opaque. I remember reading how LLMs cannot explain their own behaviour when asked.

I feel this would give insight into all that including the degree of true conceptualisation. I’m curious if this can demonstrate what else the model is aware of when answering, too.

link

adebayoj 115 days ago

Our decomposition allows us to answer question like: for 84 percent of the model's representation, we know it is relying on this concept to give an answer.

We can also trace its behavior to the training data that led to it, so that can show us where some of these concepts are formed from.

link

ottah 115 days ago

> wouldn't you like to know why a model is being sycophantic? Or Sandbagging?

Actually, emphatically no. The only thing I care about is that I have recourse. It shouldn't matter the reason, in fact explainability can be an impediment to accountability. It's just another plausible barrier to a remedy that a bureaucracy can use deny changing a decision.

link

0xdeadbeefbabe 115 days ago

Hmm so like git blame?

link