|
|
|
|
|
by dartos
465 days ago
|
|
> We don't know what actually happens in the model at time of inference. How could we not know? Every processor instruction is observable. What we specifically don’t have a good view is the causal relationship between input tokens, a model’s weights, and the output. We don’t know specifically what weights matter or why. That’s very different than not understanding what processes are taking place. |
|
We only know how the structures are designed to work, and we have hypothesise of how they likely work. We can't interpret what actually happens when the LLM is actually going through the process of generating a response.
That seems pedantic or unimportant on the surface, but there are some really important implications. At the more benign level, we don't know why a model gave a bad response when a person wasn't happy with the output. On the more important end, any concerns related to the risk of these models becoming self-directed or malicious simply can't be recognized or guarded against. We won't know if a model becomes self-directed until after it acts on it in ways that don't match how we already expect them to work.
Both alignment and interoperability were important research topics for decades of AI research. We effectively abandoned those topics once we made real technological advancement - once an AI-like tool was no longer entirely theoretical we couldn't be bothered focusing resources on figuring out how to do it safely. The horse was already out of the barn.
Does this mean they will turn evil or end up going poorly for us? Absolutely not. It just means that we have to cross our fingers and hope because we can't detect issues early.
[1] https://arxiv.org/abs/2309.01029