I don't recall the Chinchilla paper disputing my point. They establish "training-compute optimal" scaling laws, but none of their findings suggest that loss hits any kind of asymptote.
Perhaps we're talking past each other, is "loss threshold" a specific term in LLM literature?
Merely pointing out that the debate as to whether we are compute or data limited (OP) has not concluded at all; There are lots of compelling theories on relationship between the two.
Merely pointing out that the debate as to whether we are compute or data limited (OP) has not concluded at all; There are lots of compelling theories on relationship between the two.