Hacker News new | ask | show | jobs
by aszen 2173 days ago
Interesting, I wonder what happens now that Moore's law is considered dead and we can't rely on computation power increasing year over year. To make further progess with general purpose search and learning methods we will need lots more computational power which may not be cheaply available. Then do we focus our efforts on developing more efficient learning strategies like the one we have in our minds ?

I do agree with the part about not embedding human knowledge into our computer models, any knowledge worth learning about any domain the computer should be able learn on its own to make true progress in AI.

5 comments

The amount of compute used in the largest AI training runs has been exponentially growing:

https://openai.com/blog/ai-and-compute/

The amount of compute required for Imagenet classification has been exponentially decreasing:

https://openai.com/blog/ai-and-efficiency/

My background is in NLP - I suspect we'll see similar in language processing models as we've seen in vision models. Consider this[1] article ("NLP's ImageNet moment has arrived"), comparing AlexNet in 2012 to the first GPT model 6 years later: we're just a few years behind.

True, GPT-2 and -3, RoBERTa, T5 etc. are all increasingly data- and compute-hungry. That's the 'tick' your second article mentions.

We simultaneously have people doing research in the 'tock' - reducing the compute needed. ICLR 2020 was full of alternative training schema that required less compute for similar performance (e.g. ELECTRA[2]). Model distillation is another interesting idea that reduces the amount of inference-time compute needed.

[1] https://thegradient.pub/nlp-imagenet/

[2] https://openreview.net/pdf?id=r1xMH1BtvB

Very interesting links, thanks for sharing.

So the trend isn't changing we still need bigger models to make progress in NLP and CV, while the algorithmic effeciencies are promising but they aren't giving anywhere near the same improvements as larger models.

I'm curious how long this trend will continue and if there's anything promising that can reverse this trend

IMHO the main thing that determines this trend is whether the results are good enough. For the most part, there's only some overlap between the people who work on better results and people who work on more efficient results, those research directions are driven by different needs and thus also tend to happen in different institutions.

As long as our proof of concept solutions don't yet solve the task appropriately, as long as the solution is weak and/or brittle and worse than what we need for the main partical applications, most of the research focus - and the research progress - will be on models that try and give better results. It makes sense to disregarding the compute cost and other impractical inconveniences when working on pushing the bleeding edge, trying to make the previously impossible things possible

However, when tasks are "solved" from the academic proof-of-concept perspective, then generally the practical, applied work on model efficiency can get huge reductions in computing power required. But that happens elsewhere.

The concept of technology readiness level (https://en.wikipedia.org/wiki/Technology_readiness_level) is relevant. For the NLP and CV technologies that are in TRL 3 or 4, the efficiency does not really matter as long as it fits in whatever computing clusters you can afford; this is mainly an issue for the widespread adoption of some tech in industry by the time the same tech is in TRL 6 or so, and this work mostly gets done by different people in different organizations with different funding sources than the initial TRL 3 research.

Moore's law might be dead but the deeper law is still alive.

Moore's law is technically "the number of transistors per unit area doubles every 24 months" [1]. The more important law is that the cost of transistors halves every 18-24 months.

That is, Moore's law talks about how many transistors we can pack into a unit area. The deeper issue is how much it costs. If we can only pack in a certain amount transistors per area but the cost drops exponentially, we still see massive gains.

There's also Wright's law that comes into play [3] that talks about dropping exponential costs just from institutional knowledge (2x in production leads to (.75-.9)x in cost).

[1] https://en.wikipedia.org/wiki/Moore%27s_law

[2] https://www.youtube.com/watch?v=Nb2tebYAaOA

[3] https://en.wikipedia.org/wiki/Experience_curve_effects

Agreed the cost aspect of Moore's law may continue to remain true, especially with chiplets with varying fab nodes and 3d architectures. Wright's law will also bring down costs as lower nm nodes mature.

But as mentioned in the comments below ai model training is increasing exponentially (compute required to train models has been doubling every 3.6 months) so it still far outstrips the cost savings.

It really irks me that these things are called "laws". A law is something we expect to hold true forever, by means of the hypothetico-deductive scientific method.

They're phenomena. They're patterns we observe, and that's it. The pattern may change anytime, and that's something that should be expected. The causes may be known or unknown, but to call it a law may even make it hold true for longer, for "psychological" reasons. The law of gravity isn't influenced by what SpaceX investors think about it.

Can you elaborate why you think that Moore's law is considered dead? For me it seems that the general progress for the computing hardware in question (GPUs and specialized ASICs, not consumer CPUs) we're still seeing steady improvements in transistors/$ and flops/$ and expect it to still continue for some time at least.
Yes specialized hardware for AI are seeing steady improvements, I'm curious if these improvements rely on the particulars of the algorithms running on these machines. As an example several of the AI chips use lower precision floating point numbers than general CPUs since the algorithms in use for training nns don't need the higher precision.

I actually wonder if having specialized AI hardware isn't the same problem as having specialized AI models, that is in the short term it will improve efficiency but in the long run prevent discovery of newer general learning strategies because they won't run faster in existing specialized hardware.

So I know Moore's law is "dead" (dead as in Cobol or dead as in Elvis?) and progress is definitely slower than it has been historically however we have only began to really start leveraging parallelization at scale from a software perspective, so I think we have some runway in that direction, and of course the looming elephant on the horizon, Quantum computing.

Sure it is in it's infancy but assuming that the research continues to prove that quantum computing is viable I expect it to be an even bigger deal than the move from vacuum tubes to transistors. At that point we'll be dealing with an entirely different world in computing.

it's kind of poetic that the chief bottleneck of advancement in the field is now the physical universe -