|
|
|
|
|
by NiloCK
46 days ago
|
|
I've already posted a couple of times here but I'm pretty jazzed with this publication. Some thoughts: 1. It's amazing how strong the obvious in hindsight is for this research. LLMs have been (rightly) characterized as inscrutable black boxes. If only there were some discipline for learning and extracting semantics from information dense payloads ... !? 2. NLAs seem to be in the ballpark of a safety and interpretability standard that is both enforceable (easy?) and plausibly effective (probably hard to prove definitively, but easy to believe at least partially). 3. NLAs here are trained against the residual stream of a model at some layer (N). It would be interesting to see a sequence of NLAs against a staggered set of layers. There may be a semantically meaningful evolution of 'thought' going from the early to late layers. 4. I would love to see this technique applied against tokens across boundaries of model 'aha!' moments (to what extent is the 'aha' an affectation, or is there actually a sharp turn in the understandings?), and jailbreaks / personality snaps [1]. [1] - https://gemini.google.com/share/6d141b742a13 |
|