| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by petilon 1 day ago
	Noam Shazeer was one of the lead authors of the seminal paper "Attention Is All You Need", which introduced the transformer architecture. (From Wikipedia)

1 comments

tmule 1 day ago

This understates his criticality. The author list was randomized, but the critical idea was truly his. Wonder what this says about GDM …

link

HarHarVeryFunny 1 day ago

The architecture was Shazeer's, but the rough idea came from Jakob Uszkoreit who initiated the project.

Uszkoreit wanted to build a more efficient/scalable language/seq2seq model that could take advantage of GPU parallelism (replacing RNNs which were the main approach to sequence modelling at that time).

Uszkoreit's insight was that although language appears sequential, it is in fact really part parallel part hierarchical, as can be seen by linguist's sentence parse trees where at each level there is parallelism/independence between the branches of the tree, with them getting combined at the next level up. This is what gave rise to the idea of a model that consisted of a stack of of parallel processing layers (transformer layers). I believe that attention was also part of the plan from day one, as this had already been proven to be valuable (Bahdanau) with RNN seq2seq modelling.

So, this is what Uszkoreit wanted to build, but by his own account he failed to come up with an implementation that matched or outperformed the prevailing RNN approach that he wanted to replace. At this point, Uszkoreit mentioned the idea to Shazeer, who got on board and eventually arrived at a performant architecture which was then pared back by an ablation process resulting in the initial encoder-decoder Transformer architecture. Shazeer later came up with the mixture-of-experts architecture, and also other optimizations after he left to found character.ai

link

abixb 22 hours ago

Curious about others' contributions, such as Vaswani, Parmar, Jones and Gomez, to the paper. What sucks about co-authorship in research papers is that you don't get a clean breakdown of who contributed what to the research paper, and the distribution (in more cases than not) is very much like a pareto distribution.

I'm talking from plenty of group project experience here.

link

cl3misch 36 minutes ago

> What sucks about co-authorship in research papers is that you don't get a clean breakdown of who contributed what to the research paper

Why? If you read a research paper for its content this is not especially important.

This thread is more about the people of course, and here we care, but that's not the point of a research paper.

link

hatsix 5 hours ago

This is fascinating. Do you know if there's something I can read that has this mix of timeline and technical detail?

link

senordevnyc 22 hours ago

Can you expound on the ablation process? Is that referring to a stripping down of the data or weights or something? Or a stripping down of the transformer architecture structurally? Just curious

link

tedd4u 22 hours ago

You train the model then do a baseline evaluation. Then you evaluate many variants where you have removed or nulled out different layers or chunks of the model. By comparing the performance of those mutated models to the baseline you can learn a lot about the model. What parts don't have much value and can be removed, the location of "functions" or "facts." Etc. Google it.

link

tintor 17 hours ago

How come they didn’t ablate encoder? OpenAI GOT models are decoder only.

link

HarHarVeryFunny 8 hours ago

It was originally built as a general purpose sequence-to-sequence (seq2seq) model.

The research history leading up to this was interesting - there had been a bunch of work, in various domains, on "autoencoder" architectures used to learn compact representations for things like dimensionality reduction and sequence representation. The idea was to have an encoder-decoder pair, connected by a limited bottleneck representation, with the training goal of the decoder reconstructing the encoder input from the bottleneck representation.

One example of this was to learn a fixed size(!) sequence (e.g. sentence) representation using an LSTM-based autoencoder (LSTM->embedding->LSTM), which at the time seemed rather shocking - the ability to represent a variable length sequence with a fixed size embedding. Equally shocking was that you could use this for machine translation simply by connecting an LSTM encoder for one language to an LSTM decoder for another language.

This type of LSTM->LSTM seq2seq encode-decode architecture for machine translation was then improved by Bahdanau by replacing the fixed size representation with an attention mechanism so the decoder could learn to be more specific about input-output relationships.

This type of LSTM-based seq2seq encode-decode architecture, using attention, is what Uszkoreit et al set out to improve - to make more efficient by using a parallel vs sequential (RNN) architecture. The Transformer was never conceived of as purely for language modelling, or as an "AI" architecture. Later when the usage focused on language modelling (generation, not translation), the encoder was dropped since input and output are the same thing.

link

mike_hearn 15 hours ago

If you read the Wired article linked elsewhere on this thread, then it explains that. The work was being done by people from the Google Translate team.

link

flebron 1 day ago

Source for this? The notion of attention dates to a content-addressable lookup during sequence alignment (as well as, concurrently, memory lookups in neural Turing machines). Attention had been used in other models, like GRUs and LSTMs with attention. The Vaswani et. al. paper did not introduce attention, just removed everything _but_ attention (and FFW) from the network. Are you claiming the "critical idea" of removing the GRU and LSTM parts and just keeping attention was "truly" Noam's?

link

daemonologist 1 day ago

At some point in late 2017 the paper was updated with this additional detail:

    Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.

In any case, if the authors considered their contributions equal, that's good enough for me.

link

tmule 22 hours ago

Thanks - wanted to point to this, and indeed should have worded my claim more precisely. And yes, am aware of prior work on attention. (I need to look it up, but I recall Noam saying publicly that he wouldn’t have agreed to random ordering of contributions if he knew this was going to be this big).

link

mi_lk 1 day ago

I don't know we can just say things now. Ah we're on the internet

link

dyauspitr 20 hours ago

That’s not true. Jakob, Ashish and Ilia for the core idea and initial implementation and Noam for several critical details on implementation.

link

d4rkp4ttern 1 day ago

Is this a generally well known thing?

link

tmule 22 hours ago

Nope, but it’s not particularly unknown either. It shouldn’t be a surprise; he had remarkable research contributions before and after (separately, he was also an IMO gold medalist).

link

markdown 1 day ago

Even more important, I wonder what it says about HBW...

link

khazhoux 1 day ago

Even if we knew, we’d still fail to understand GHO

link

fastball 1 day ago

But more importantly the impact this has on TLAs

link