|
> …reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits. At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience. Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha! |
The first paragraph is clear linear algebra terminology, the second looked like deeper subfield specific jargon and I was about to ask for a citation as the words definitely are real but the claim sounded hyperspecific and unfamiliar.
I figure a person needs 12 to 18 months of linear algebra, enough to work through Horn and Johnson's "Matrix Analysis" or the more bespoke volumes from Jeffrey Humpheries to get the math behind ML. Not necessarily to use AI/ML as a tech, which really can benefit from the grind towards commodification, but to be able to parse the technical side of about 90 to 95 percent of conference papers.