|
|
|
|
|
by barelyauser
649 days ago
|
|
I find that researchers choices of names for the sake of differentiation is more of a barrier than something helpful. Sometimes it feels like I know nothing, but in reality it is the name of the "technique" or phenomena that does not get parsed by my brain. Things like "Compute-Optimal Sampling" sound just like any other made up gibberish that may or may not exist. Wordings like "memory-centric subsampling", "search based hyper space modeling", "locally induced entropy optimization" don't get parsed. And more often than not after reading such papers, I've come to find out that it is a fancy name for something a toddler knows about. Really disappointing. |
|
Of course there are some (possibly many!) papers where jargon is abused to make something sound smarter. Sometimes this can also happen unintentionally.
In this case, "compute-optimal X" is standard terminology used in large-scale ML model design for finding the most optimal tradeoff with regards to compute when trying to achieve X.
Here, the paper is about finding the optimal model size tradeoff when training on LLM-generated synthetic data. Imagine you have a class of LLMs, from small to infinitely large. The larger the LLM, the higher the quality of your synthetic data, but you will also spend more compute to generate this data ("sampling" the data). Smaller LLMs can generate more data with the same compute budget, but at worse quality.
The paper does some experiments to find that in their case, you don't always want the largest possible LLM for synthetic data (as previously thought by many practitioners), instead you can get further by making more calls to a smaller but worse LLM.