Hacker News new | ask | show | jobs
by cs702 2184 days ago
"Quién es más macho?"

In a very short time, transformers have gone from under 1B, to 1.5B, to 3B, to 5B, to 175B, and now 600B parameters. 1T is only, what, like 67% more parameters, and therefore likely to be achieved in the short term. In fact, the authors of this paper tried 1T but ran into numerical issues that they will surely address soon. Not long after someone crosses 1T, expect 10T to become the next target. And why not? The best-funded AI research groups are in a friendly competition to build the biggest, baddest, meanest m-f-ing models the world has ever seen.

Scores continue to increase with diminishing returns, which is all fine and nice, but more importantly it seems we should expect to see machine-generated text getting much better from a qualitative standpoint -- that is, becoming less and less distinguishable from a lot of human output. That has been the trend so far.

We live in interesting times.

3 comments

This is a sparse model. You can't just compare the parameter count against dense models like GPT-3.

Otherwise, Google already had a 137B parameter model in 2017: https://arxiv.org/abs/1701.06538

The model is sparsely-gated, not sparse. The individual experts in each mixture of experts are dense layers but they're sparsely activated, i.e., on each forward pass only some of them are conditionally used.

As to comparing parameter counts, I disagree with you. I think it's perfectly OK to compare parameter counts for different kinds of models. It would also be perfectly OK to compare, say, computational efficiency per parameter in each forward pass (which for this model is impressive), but that wasn't the focus on my comment above.

Finally, you're right that I didn't mention all the interim parameter counts that we have seen below 600B in all transformer variants. The list would have been way too long had I tried to include every figure!

Probably the most relevant comparison here would be a mix of wallclock-hours and FLOPS. The MoE may be inefficient on a parameter level, but it may be the most efficient way to convert FLOPS into model power (sort of like how you currently do better making models wider than deeper - experts are the ultimate 'width').
It depends on your goal. If you want to measure the number of "artificial synapses" or connections, total parameters is the right figure to use, because each weight is one such connection. If you want to measure the computational cost of training or inference, then wallclock-hours and FLOPs would be better figures.

The 100's of trillions of connections (synapses) in the human brain are sparsely used -- i.e., your entire brain doesn't light up in response to every single stimulus. But we still talk about 100's of trillions of synapses when we refer to the size of the human brain's connectome. It's a perfectly valid way of measuring model size.

More to your point, the authors measure the computational cost of training in Table 3 of the paper in TPU-core-years for the various mixture-of-expert models, and compare them to an always-densely-used variant.

Fair enough, sparse usually means weights are sparse and not activations.

Obviously you can compare parameter count if you really want to, but from a technical point of view training a densely activated model is a much bigger feat. Also, I have personally spoken to one of the authors of this paper and they said sparsely activated models tend to well better on tasks that require knowledge but not tasks that require intelligence (e.g. GLUE).

I agree, training a dense model with the same number of parameters would be much a bigger feat.

Otherwise, as I mentioned elsewhere on this page, we routinely describe the size of the human brain in terms of numbers of synapses (connections), even though they are sparsely activated. Only a small subset of your brain 'lights up' for a given input. Number of parameters (connections) is a perfectly sensible way to measure model size.

Anyway, I expect we will see both much larger sparsely and densely activated models going forward. We live in interesting times :-)

Well if you can come up with a more elegant way forward, one that doesn’t require all that hardware and money and should, therefore, by definition be within the reach of the critics of “big” AI, I see no reason why it’s superior qualitative results wouldn’t be appreciated.
I'm not a critic! But I can see how my comment, meant to be a bit humorous, could be misinterpreted as being critical.

Personally, I think the friendly race to build bigger models is a great development. As I mentioned above, it seems to be leading to models that generate text/sequences that are qualitatively much better.

Are they using this for google translate yet. As https://www.deepl.com/en/translator is better than google translate currently. Although for translating forums on a website etc I think netflix method would be better I hope google adopts it for its translate app https://arxiv.org/abs/2005.11197
Highly unlikely at the moment. But clearly that is the direction in which translation is going, so companies lacking the economies of scale that come with owning massive computational infrastructure will be at a serious disadvantage.
Google Translate's last major update was in May. They still use more restricted model sizes than the top-end research, but the techniques are making their way into the product.

https://ai.googleblog.com/2020/06/recent-advances-in-google-...

With a cursory analysis, it's not obvious whether DeepL is better than Google Translate any more.