|
|
|
|
|
by NiloCK
47 days ago
|
|
Future model training runs will have a copy of this research, and know "to defend against it". EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs? |
|