|
|
|
|
|
by adebayoj
115 days ago
|
|
op here. Important point, but I disagree. We see explainability/interpretability as a CORE need for AI safety. We believe you can't align/audit/debug/fix a system that you don't understand. Just to give you some answers for what we can do: 1) We can find the training data that is causing a model to output toxic/unwanted text and correct it.
2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept. Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of . For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this! |
|
I feel this would give insight into all that including the degree of true conceptualisation. I’m curious if this can demonstrate what else the model is aware of when answering, too.