The biggest distinction in architecture between BERT and GPT is that BERT looks both ways from a given token. This helps give context to a token. This is what made BERT great at the time because the surrounding text, before and after, could change the meaning of the token we are at. You could essentially fill in the middle, or rather correct what's in the middle after it's been said. I believe this is why Apple is using it for iOS 17's auto-correct.
GPT predicts the next word by only look back at what we have seen so far. In other words, it's auto regressive.
Remember not to confuse interface with implementation. GPT's interface is a single stream of tokens - so if you want it to see before and after context, that just means you have to encode them into a single stream.
this comment is downvoted but it is correct. while fill-in-the-middle is less obvious in the decoder-only paradigm, it is still possible. one example is code-llama https://ai.meta.com/blog/code-llama-large-language-model-cod... it is a variant of llama 2 (GPT-style decoder-only) but it supports infilling
> [W]e split training documents at the character level into a prefix, a middle part[,] and a suffix with the splitting locations sampled independently from a uniform distribution over the document length. We apply this transformation with a probability of 0.9 and to documents that are not cut across multiple model contexts only. We randomly format half of the splits in the prefix-suffix-middle (PSM) format and the other half in the compatible suffix-prefix-middle (SPM) format described in Bavarian et al. (2022, App. D). We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix, the middle part or the suffix, and the end of the infilling span
yes but code llama also found that the PSM format was inferior to the SPM format presumably because those hard cuts lose context. the "real" fill-in-the-middle of BERT is i think more likely to model language compared to the "faux" F-i-t-m of flinging prefixes and suffixes around
where is that reported? in Table 14 I see PSM performing much better than SPM. I also see a note about the SPM performance which attributes the degradation to the tokenizer edge cases
> As an example, our model would complete the string 'enu' with 'emrate' instead of 'merate' which shows awareness of the logical situation of the code but incomplete understanding of how tokens map to character-level spelling.
that doesn't really feel like a failure of language modeling to me
> Note, however, that the results in
random span infilling are significantly worse in suffix-prefix-middle (SPM) format than in prefix-suffix-middle
(PSM) format as it would require token healing (Microsoft, 2023),
Not defining "BERT" is a little weird, especially since something like MLM is explained. A good rule for writing, which I learned from The Economist, is to explain what something is the first time you mention it. It does lead to some funny explanations sometimes. One of my favorites, again from The Economist, is "HSBC, a bank". It sort of let them know that even if they see themselves as being big and important, they are "just a bank".
No. There are at least two kinds of costs. First, It takes time to search 'adjacent' domains. Second, by reducing your available acronyms/initialisms, you make it harder to map your architecture name onto those letters.
It is fun to think of some of the alternative BERT names that "could have been", such as BIDET = BIDirectional Encoder representations from Transformers.
GPT predicts the next word by only look back at what we have seen so far. In other words, it's auto regressive.