| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TastyDucks 753 days ago

The use of this sort of anthropomorphic and "incantation" style prompting is a workaround while mechanistic interpretability and monosemanticity work[1] is done to expose the neuron(s) that have larger impacts on model behavior -- cf Golden Gate Claude.

Further, even if end-users only have access to token input to steer model behavior, we likely have the ability to reverse engineer optimal inputs to drive desired behaviors; convergent internal representations[2] means this research might transfer across models as well (particularly, Gemma -> Gemini, as I believe they share the same architecture and training data).

I suspect we'll see understandable super-human prompting (and higher-level control) emerge from GAN and interpretability work within the next few years.

[1]: https://transformer-circuits.pub/2024/scaling-monosemanticit... [2]: https://arxiv.org/abs/2405.07987