| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by energy123 344 days ago
	Mechanistic interpretability could play a role here. The sycophancy you describe in chat mode could be when the question is "too difficult" and the AI defaults to easy circuits that rely on simple rule of thumbs (like does the context contain positive words such as "excellent"). The user experiences this as the AI just following basic nudges. Could real-time observability into the network's internals somehow feed back into the model to reduce these hallucination-inducing shortcuts? Like train the system to detect when a shortcut is being used, then do something about it?