|
|
|
|
|
by HarHarVeryFunny
1114 days ago
|
|
I'm not sure how much is actually known to write about, but what I'd like to see explained is how transformer-based LLMs/AI really work - not at the mechanistic level of the architecture, but in terms of what they learn (some type of world model ? details, not hand waving!) and how do they utilize this when processing various types of input ? What type of representations are being used internally in these models ? We've got token embeddings going in, and it seems like some type of semantic embeddings internally perhaps, but exactly what ? OTOH it's outputting words (tokens) with only a linear layer between the last transformer block and the softmax, so what does that say about the representations at that last transformer block ? |
|
One of the most interesting presentations in the last session of the workshop is this talk by David Bau titled "Direct Model Editing and Mechanistic Interpretability". David and his team locate exact information in the model, and edit it. So for example they edit the location of the Eiffel Tower to be in Rome. So whenever the model generates anything involving location (e.g., the view from the top of the tower), it actually describes Rome
Talk: https://www.youtube.com/watch?v=I1ELSZNFeHc
Paper: https://rome.baulab.info/
Follow-up work: https://memit.baulab.info/
There is also work on "Probing" the representation vectors inside the model and investigating what information is encoded at the various layers. One early Transformer Explainability paper (BERT Rediscovers the Classical NLP Pipeline https://arxiv.org/abs/1905.05950) found that "the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way: POS tagging, parsing, NER, semantic roles, then coreference". Meaning that the representations in the earlier layers encode things like whether a token is a verb or noun, and later layers encode other, higher-level information. I've made an intro to these probing methods here: https://www.youtube.com/watch?v=HJn-OTNLnoE
A lot of applied work doesn't require interpretability and explainability at the moment, but I suspect the interest will continue to increase.