Hacker News new | ask | show | jobs
by heliophobicdude 1023 days ago
The biggest distinction in architecture between BERT and GPT is that BERT looks both ways from a given token. This helps give context to a token. This is what made BERT great at the time because the surrounding text, before and after, could change the meaning of the token we are at. You could essentially fill in the middle, or rather correct what's in the middle after it's been said. I believe this is why Apple is using it for iOS 17's auto-correct.

GPT predicts the next word by only look back at what we have seen so far. In other words, it's auto regressive.

3 comments

Remember not to confuse interface with implementation. GPT's interface is a single stream of tokens - so if you want it to see before and after context, that just means you have to encode them into a single stream.
this comment is downvoted but it is correct. while fill-in-the-middle is less obvious in the decoder-only paradigm, it is still possible. one example is code-llama https://ai.meta.com/blog/code-llama-large-language-model-cod... it is a variant of llama 2 (GPT-style decoder-only) but it supports infilling

> [W]e split training documents at the character level into a prefix, a middle part[,] and a suffix with the splitting locations sampled independently from a uniform distribution over the document length. We apply this transformation with a probability of 0.9 and to documents that are not cut across multiple model contexts only. We randomly format half of the splits in the prefix-suffix-middle (PSM) format and the other half in the compatible suffix-prefix-middle (SPM) format described in Bavarian et al. (2022, App. D). We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix, the middle part or the suffix, and the end of the infilling span

Tokens are super powerful :)

yes but code llama also found that the PSM format was inferior to the SPM format presumably because those hard cuts lose context. the "real" fill-in-the-middle of BERT is i think more likely to model language compared to the "faux" F-i-t-m of flinging prefixes and suffixes around
where is that reported? in Table 14 I see PSM performing much better than SPM. I also see a note about the SPM performance which attributes the degradation to the tokenizer edge cases

> As an example, our model would complete the string 'enu' with 'emrate' instead of 'merate' which shows awareness of the logical situation of the code but incomplete understanding of how tokens map to character-level spelling.

that doesn't really feel like a failure of language modeling to me

i flipped the results, my bad.

> Note, however, that the results in random span infilling are significantly worse in suffix-prefix-middle (SPM) format than in prefix-suffix-middle (PSM) format as it would require token healing (Microsoft, 2023),

yeah, I hear you that the decoder-only infilling approach is 'weird' -- I just don't know if I agree that it's manifestly worse at language understanding / performance than the BERT appraoch
>I believe this is why Apple is using it for iOS 17's auto-correct.

Talk about a negative endorsement. I am continually disappointed in the auto correction implementation.

To be fair, iOS 17 is not out yet. If you are currently using it, you are beta testing it.
Aren't both of them based on embeddings?