| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by NiloCK 47 days ago
	Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in. It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.

1 comments

kraddypatties 46 days ago

I think the only thing that gives me pause is the fact that they SFT on Opus 4.5 explanations as a pertaining step. But, generally I agree, especially given the auto encoder is only seeing a single token activation!

link