| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by heliophobicdude 1023 days ago
	The biggest distinction in architecture between BERT and GPT is that BERT looks both ways from a given token. This helps give context to a token. This is what made BERT great at the time because the surrounding text, before and after, could change the meaning of the token we are at. You could essentially fill in the middle, or rather correct what's in the middle after it's been said. I believe this is why Apple is using it for iOS 17's auto-correct. GPT predicts the next word by only look back at what we have seen so far. In other words, it's auto regressive.

3 comments

astrange 1023 days ago

Remember not to confuse interface with implementation. GPT's interface is a single stream of tokens - so if you want it to see before and after context, that just means you have to encode them into a single stream.

link

huac 1022 days ago

this comment is downvoted but it is correct. while fill-in-the-middle is less obvious in the decoder-only paradigm, it is still possible. one example is code-llama https://ai.meta.com/blog/code-llama-large-language-model-cod... it is a variant of llama 2 (GPT-style decoder-only) but it supports infilling

> [W]e split training documents at the character level into a prefix, a middle part[,] and a suffix with the splitting locations sampled independently from a uniform distribution over the document length. We apply this transformation with a probability of 0.9 and to documents that are not cut across multiple model contexts only. We randomly format half of the splits in the prefix-suffix-middle (PSM) format and the other half in the compatible suffix-prefix-middle (SPM) format described in Bavarian et al. (2022, App. D). We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix, the middle part or the suffix, and the end of the infilling span

Tokens are super powerful :)

link

swyx 1022 days ago

yes but code llama also found that the PSM format was inferior to the SPM format presumably because those hard cuts lose context. the "real" fill-in-the-middle of BERT is i think more likely to model language compared to the "faux" F-i-t-m of flinging prefixes and suffixes around

link

huac 1022 days ago

where is that reported? in Table 14 I see PSM performing much better than SPM. I also see a note about the SPM performance which attributes the degradation to the tokenizer edge cases

> As an example, our model would complete the string 'enu' with 'emrate' instead of 'merate' which shows awareness of the logical situation of the code but incomplete understanding of how tokens map to character-level spelling.

that doesn't really feel like a failure of language modeling to me

link

swyx 1022 days ago

i flipped the results, my bad.

> Note, however, that the results in random span infilling are significantly worse in suffix-prefix-middle (SPM) format than in prefix-suffix-middle (PSM) format as it would require token healing (Microsoft, 2023),

link

huac 1020 days ago

yeah, I hear you that the decoder-only infilling approach is 'weird' -- I just don't know if I agree that it's manifestly worse at language understanding / performance than the BERT appraoch

link

fbdab103 1022 days ago

>I believe this is why Apple is using it for iOS 17's auto-correct.

Talk about a negative endorsement. I am continually disappointed in the auto correction implementation.

link

pilotneko 1022 days ago

To be fair, iOS 17 is not out yet. If you are currently using it, you are beta testing it.

link

3abiton 1022 days ago

Aren't both of them based on embeddings?

link