| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by canjobear 673 days ago

It’s true that the Transformer architecture was developed for seq2seq MT, but you can get similar performance with Mamba or RWKV or other new non-Transformer architectures. It seems that what is important is having a strong general sequence-learning architecture plus tons of data.

> The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot.

The Markov nature only matters if the text falls outside the context window.

> Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.

I’m not sure what you’re getting at here. If it’s that you can predict the next token in many cases without looking at the source language, then that’s also true for traditional encoder-decoder architectures, so it’s not a problem unique to prompting. Or are you getting at problems arising from teacher-forcing?

Basically the question was how an LM could possibly help translation, and the answer is that it gives you a strong prior for the decoder. That’s also the basic idea in the theoretical UMT paper: you are trying to find a function from source to target language that produces a sensible distribution as defined by an LM.