Hacker News new | ask | show | jobs
by ceph_ 1022 days ago
Since the article never defines what a BERT is:

(Bidirectional Encoder Representations from Transformers)

https://en.wikipedia.org/wiki/BERT_(language_model)

3 comments

The biggest distinction in architecture between BERT and GPT is that BERT looks both ways from a given token. This helps give context to a token. This is what made BERT great at the time because the surrounding text, before and after, could change the meaning of the token we are at. You could essentially fill in the middle, or rather correct what's in the middle after it's been said. I believe this is why Apple is using it for iOS 17's auto-correct.

GPT predicts the next word by only look back at what we have seen so far. In other words, it's auto regressive.

Remember not to confuse interface with implementation. GPT's interface is a single stream of tokens - so if you want it to see before and after context, that just means you have to encode them into a single stream.
this comment is downvoted but it is correct. while fill-in-the-middle is less obvious in the decoder-only paradigm, it is still possible. one example is code-llama https://ai.meta.com/blog/code-llama-large-language-model-cod... it is a variant of llama 2 (GPT-style decoder-only) but it supports infilling

> [W]e split training documents at the character level into a prefix, a middle part[,] and a suffix with the splitting locations sampled independently from a uniform distribution over the document length. We apply this transformation with a probability of 0.9 and to documents that are not cut across multiple model contexts only. We randomly format half of the splits in the prefix-suffix-middle (PSM) format and the other half in the compatible suffix-prefix-middle (SPM) format described in Bavarian et al. (2022, App. D). We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix, the middle part or the suffix, and the end of the infilling span

Tokens are super powerful :)

yes but code llama also found that the PSM format was inferior to the SPM format presumably because those hard cuts lose context. the "real" fill-in-the-middle of BERT is i think more likely to model language compared to the "faux" F-i-t-m of flinging prefixes and suffixes around
where is that reported? in Table 14 I see PSM performing much better than SPM. I also see a note about the SPM performance which attributes the degradation to the tokenizer edge cases

> As an example, our model would complete the string 'enu' with 'emrate' instead of 'merate' which shows awareness of the logical situation of the code but incomplete understanding of how tokens map to character-level spelling.

that doesn't really feel like a failure of language modeling to me

i flipped the results, my bad.

> Note, however, that the results in random span infilling are significantly worse in suffix-prefix-middle (SPM) format than in prefix-suffix-middle (PSM) format as it would require token healing (Microsoft, 2023),

>I believe this is why Apple is using it for iOS 17's auto-correct.

Talk about a negative endorsement. I am continually disappointed in the auto correction implementation.

To be fair, iOS 17 is not out yet. If you are currently using it, you are beta testing it.
Aren't both of them based on embeddings?
Not defining "BERT" is a little weird, especially since something like MLM is explained. A good rule for writing, which I learned from The Economist, is to explain what something is the first time you mention it. It does lead to some funny explanations sometimes. One of my favorites, again from The Economist, is "HSBC, a bank". It sort of let them know that even if they see themselves as being big and important, they are "just a bank".
Thanks, will keep this in mind for future posts!
I find the acronym unfortunate because it overloads a previously unique acronym in an adjacent space. It costs nothing to avoid these issues.

https://en.wikipedia.org/wiki/Bit_error_rate#Bit_error_rate_...

> It costs nothing to avoid these issues.

No. There are at least two kinds of costs. First, It takes time to search 'adjacent' domains. Second, by reducing your available acronyms/initialisms, you make it harder to map your architecture name onto those letters.

It is fun to think of some of the alternative BERT names that "could have been", such as BIDET = BIDirectional Encoder representations from Transformers.

Yes, I visited this thread hoping for BERT software that can find memory errors, either due to radiation upset or signal integrity issues.

Now my disappointment is immeassurable and my day is ruined.

If your computer is a Xilinx MPSoC there is IBERT. It's a pretty awesome concept.