| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by apsec112 889 days ago
	Interesting that the transformer used is tiny. From the paper: "We use the Meliad library for transformer training with its base settings. The transformer has 12 layers, embedding dimension of 1,024, eight heads of attention and an inter-attention dense layer of dimension 4,096 with ReLU activation. Overall, the transformer has 151 million parameters, excluding embedding layers at its input and output heads. Our customized tokenizer is trained with ‘word’ mode using SentencePiece and has a vocabulary size of 757. We limit the maximum context length to 1,024 tokens and use T5-style relative position embedding. Sequence packing is also used because more than 90% of our sequences are under 200 in length."

2 comments

albertzeyer 889 days ago

It's not tiny, this is a quite normal size outside the field of LLMs, e.g. normal-sized language models, or also translation models, or acoustic models. Some people even would call this large.

link

apsec112 889 days ago

It's tiny by the standards of transformers, pretty sure most transformers trained (across all domains) are larger than this

link

albertzeyer 889 days ago

No. Where do you have this from?

Looking at NeurIPS 2023:

https://openreview.net/group?id=NeurIPS.cc/2023/Conference#t...

Some random spotlight papers:

- https://openreview.net/pdf?id=YkBDJWerKg: Transformer (VPT) with 248M parameters

- https://openreview.net/pdf?id=CAF4CnUblx: Vit-B/16 with 86M parameters

- https://openreview.net/pdf?id=3PjCt4kmRx: Transformer with 282M parameters

Also, in my field (speech recognition, machine translation, language modeling), all using Transformer variants, this is a pretty normal model size.

link

kristjansson 889 days ago

It's tiny by the standards of LLMs; _L_LM might give you some indication as to where they fit in the overall landscape.

link

ipnon 889 days ago

This suggests there may be some more low hanging fruit in the hard sciences for transformers to bite at, so long as they can be properly formulated. This was not a problem of scaling it seems.

link