| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by deepdarkforest 113 days ago

Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.

How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?

Would be interested to see this scale to 30/70b

2 comments

rao-v 113 days ago

+1 this does seem to be a genuine attempt to actually build an interpretable model, so nice work!

Having said that, I worry that you run into Illusion of Conscious issues where the model changes attrition from “sandbagging” to “unctuous” when you control its response because the response is generated outside of the attribution modules (I don’t quite understand how cleanly everything flows through the concept modules and the residual). Either way this is a sophisticated problem to have. Would love to see if this can be trained to parity with modern 8B models.

link

giang_at_glai 111 days ago

Actually, the model is forcing the response to be generated inside the attribution modules.

link

adebayoj 113 days ago

Down to the very exact text chunk in a document! Check this out for an idea of what smaller versions of this style of model can do: https://www.guidelabs.ai/post/prism/. We'll have more to say soon about it. We can trace any generation to 11B chunks (not documents, but actual chunks in the training data).

link