|
|
|
|
|
by ashdksnndck
369 days ago
|
|
There is already research in the literature showing that LLMs have neurons that model the gender [1], personality [2], ideology [3], and historic era [4] of the author. There’s also evidence that they model the distinction between the beliefs of the author and other characters, which has been summarized as “theory of mind” [5]. And we have only scratched the surface, with most research using small open-weight models that lag behind frontier model capabilities. [1] Z. Yu & S. Ananiadou, “Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing,” arXiv:2501.14457 (2025). [2] J. Deng et al., “Neuron-based Personality Trait Induction in Large Language Models,” arXiv:2410.12327 (2024). [3] J. Kim, J. Evans & A. Schein, “Linear Representations of Political Perspective Emerge in Large Language Models,” arXiv:2503.02080 (2025). [4] W. Gurnee & M. Tegmark, “Language Models Represent Space and Time,” arXiv:2310.02207 (2023). [5] C. Hardy, “A Sparse ToM Circuit in Gemma-2-2B,” https://xtian.ai/pages/document.pdf |
|