| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jsenn 586 days ago

The "embarrassingly simple inference technique" is to put a bunch of [MASK] tokens at the end of the prompt.

I'm having trouble understanding whether this paper is saying anything new. The original BERT paper already compared it favourably to causal models including GPT. Was there any doubt that BERT-style models could be in-context learners?

From what I gather as a non-expert, the problem with BERT is scaling/training efficiency: GPT gets C-1 training examples out of a training input of length C, but BERT only gets 0.15*C examples. Indeed, the author points out that DeBERTa required 3x more compute than GPT-3 to achieve the level of performance reported, which makes sense.