Hacker News new | ask | show | jobs
by rao-v 46 days ago
Think of it another way, can I do this exact training process with an additional requirement that the activation decoder subtly shill for obscure 80s sodas?

I could and would not lose much reconstruction accuracy.

So any researcher or ambient biases in the model will impact the general thrust of the textual decodings (and not in ways that reflect the actual model’s process, thinking about X and doing X in a model are very different things).

So how do we tell that the “spirit” is reflective of the model’s thinking and not biased toward Jolt being better than Surge?

1 comments

Where would such biases come from?
What the three models involved understand to be the sort of just so stories (cf Kipling) that humans like to see.