My background is in NLP - I suspect we'll see similar in language processing models as we've seen in vision models. Consider this[1] article ("NLP's ImageNet moment has arrived"), comparing AlexNet in 2012 to the first GPT model 6 years later: we're just a few years behind.
True, GPT-2 and -3, RoBERTa, T5 etc. are all increasingly data- and compute-hungry. That's the 'tick' your second article mentions.
We simultaneously have people doing research in the 'tock' - reducing the compute needed. ICLR 2020 was full of alternative training schema that required less compute for similar performance (e.g. ELECTRA[2]). Model distillation is another interesting idea that reduces the amount of inference-time compute needed.
So the trend isn't changing we still need bigger models to make progress in NLP and CV, while the algorithmic effeciencies are promising but they aren't giving anywhere near the same improvements as larger models.
I'm curious how long this trend will continue and if there's anything promising that can reverse this trend
IMHO the main thing that determines this trend is whether the results are good enough. For the most part, there's only some overlap between the people who work on better results and people who work on more efficient results, those research directions are driven by different needs and thus also tend to happen in different institutions.
As long as our proof of concept solutions don't yet solve the task appropriately, as long as the solution is weak and/or brittle and worse than what we need for the main partical applications, most of the research focus - and the research progress - will be on models that try and give better results. It makes sense to disregarding the compute cost and other impractical inconveniences when working on pushing the bleeding edge, trying to make the previously impossible things possible
However, when tasks are "solved" from the academic proof-of-concept perspective, then generally the practical, applied work on model efficiency can get huge reductions in computing power required. But that happens elsewhere.
The concept of technology readiness level (https://en.wikipedia.org/wiki/Technology_readiness_level) is relevant. For the NLP and CV technologies that are in TRL 3 or 4, the efficiency does not really matter as long as it fits in whatever computing clusters you can afford; this is mainly an issue for the widespread adoption of some tech in industry by the time the same tech is in TRL 6 or so, and this work mostly gets done by different people in different organizations with different funding sources than the initial TRL 3 research.
True, GPT-2 and -3, RoBERTa, T5 etc. are all increasingly data- and compute-hungry. That's the 'tick' your second article mentions.
We simultaneously have people doing research in the 'tock' - reducing the compute needed. ICLR 2020 was full of alternative training schema that required less compute for similar performance (e.g. ELECTRA[2]). Model distillation is another interesting idea that reduces the amount of inference-time compute needed.
[1] https://thegradient.pub/nlp-imagenet/
[2] https://openreview.net/pdf?id=r1xMH1BtvB