|
|
|
|
|
by saeranv
787 days ago
|
|
I think they are accounting for the entire context, they specifically write out: >> P(next_word|previous_words) So the "next_word" is conditioned on "previous_words" (plural), which I took to mean the joint distribution of all previous words. But, I think even that's too reductive. The transformer is specifically not a function acting as some incredibly high-dimensional lookup table of token conditional probabilities. It's learning a (relatively) small amount of parameters to compress those learned conditional probabilities into a radically lower-dimensional embedding. Maybe you could describe this as a discriminative model of conditional probability, but at some point, we start describing that kind of information compression as semantic understanding, right? |
|