|
|
|
|
|
by pama
781 days ago
|
|
I agree with almost all you said except that Twitter is better than top conferences, and I take a contrarian view that reviewers slow down AGI with requests for additional experiments. Without going into specifics, which you can probably guess based on your background, too many ideas that work well, even optimally, at small scale fail horribly at large scale. Other ideas that work at super specialized settings don’t transfer or don’t generalize. The saving of two or three dimensions for exact symmetry operations is super important when you deal with handful of dimensions and is often irrelevant or slowing down training a lot when you already deal with tens of thousands of dimensions. Correlations in huge multimodal datasets are way more complicated than most humans can grasp and we will not get to AGI before we can have a large enough group of people dealing with such data routinely. It is very likely detrimental for our progress to AGI that we lack abundant hardware for academics and hobbyists to contribute frontier experiments, however we don’t do anybody a favor by increasing the entropy of the publications in the huge ML conferences. This particular work listed in HN stands out despite lack of scaling and will probably make it in a top conference (perhaps with some additional background citations) but not everything that is merely novel should simply make it to ICLR or neurIPS or ICML, otherwise we could have a million papers in each in a few years from today and nobody would be the wiser. |
|
Not that I disagree, but I don't think that's a reason to not publish. There's another way to rephrase what you've said
But this is true for many works, even transformers. You don't just scale by turning up model parameters and data. You can, but generally more things are going on. So why hold these works back because of that? There may be nuggets in there that may be of value and people may learn how to scale them. Just because they don't scale (now or ever) doesn't mean they aren't of value (and let's be honest, if they don't scale, this is a real killer for the "scale is all you need" people)> Other ideas that work at super specialized settings don’t transfer or don’t generalize.
It is also hard to tell if these are hyper-parameter settings. Not that I disagree with you, but it is hard to tell.
> Correlations in huge multimodal datasets are way more complicated than most humans can grasp and we will not get to AGI before we can have a large enough group of people dealing with such data routinely.
I'm not sure I understand your argument here. The people I know that work at scale often have the worst understanding of large data. Not understanding the differences between density in a normal distribution and a uniform. Thinking that LERPing in a normal yields representative data. Or cosine simularity and orthogonality. IME people that work at scale benefit from being able to throw compute at problems.
> we don’t do anybody a favor by increasing the entropy of the publications in the huge ML conferences
You and I have very different ideas as to what constitutes information gain. I would say a majority of people studying two models (LLMs and diffusion) results in lower gain, not more.
And as I've said above, I don't care about novelty. It's a meaningless term. (and I wish to god people would read the fucking conference reviewer guidelines as they constantly violate them when discussing novelty)