|
|
|
|
|
by NiloCK
47 days ago
|
|
Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in. It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations. |
|