| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cs702 822 days ago

It's so refreshing to come across new AI research different from the usual "we modified a transformer in this and that way and got slightly better results on this and that benchmark." All those new papers proposing incremental improvements are important, but... everyone is getting a bit tired of them. Also, anecdotal evidence and recent work suggest we're starting to run into fundamental limits inherent to transformers, so we may well need new alternatives.[a]

The best thing about this new work is that it's not an either/or proposition. The proposed "learnable spline interpolations as activation functions" can be used in conventional DNNs, to improve their expressivity. Now we just have to test the stuff to see if it really works better.

Very nice. Thank you for sharing this work here!

---

[a] https://news.ycombinator.com/item?id=40179232

3 comments

godelski 822 days ago

There's a ton actually. Just they tend to go through extra rounds of review (or never make it...) and never make it to HN unless there's special circumstances (this one is MIT and CIT). Unfortunately we've let PR become a very powerful force (it's always been a thing, but seems more influential now). We can fight against this by up voting things like this and if you're a reviewee, not focusing on sota (it's clearly been gamed and clearly leading us in the wrong direction)

abhgh 821 days ago

Yes seconding this. If you want a broad view of ML IMHO the best places to look at are conference proceedings. The typical review process is imperfect so that still doesn't show you all the interesting work out there (which you mention), but it is still a start wrt diversity of research. I follow LLMs closely but then going through proceedings means I come across exciting research like these [1],[2],[3].

References:

[1] A grad.-based way to optimize axis-parallel and oblique decision trees: the Tree Alternating Optimization (TAO) algorithm https://proceedings.neurips.cc/paper_files/paper/2018/file/1.... An extension was the softmax tree https://aclanthology.org/2021.emnlp-main.838/.

[2] XAI explains models, but can you recommend corrective actions? FACE: feasible and Actionable Counterfactual Explanations https://arxiv.org/pdf/1909.09369, Algorithmic Recourse: from Counterfactual Explanations to Interventions https://arxiv.org/pdf/2002.06278

[3] OBOE: Collaborative Filtering for AutoML Model Selection https://arxiv.org/abs/1808.03233

godelski 821 days ago

Honestly, these days I just rely on arxiv. The conferences are so noisy that it is hard to really tell what's useful and what's crap. Twitter is a bit better but still a crap shoot. So as far as it seems to me, there's no real good signal to use to differentiate. And what's the point of journals/conferences if not to provide some reasonable signal? If it is a slot machine, it is useless.

And I feel like we're far too dismissive of instances we see where good papers get rejected. We're too dismissive of the collusion rings. What am I putting in all this time to write and all this time to review (and be an emergency reviewer) if we aren't going to take some basic steps forward? Fuck, I've saved a Welling paper from rejection from two reviewers who admitted to not knowing PDEs, and this was a workshop (should have been accepted into the main conference). I think review works for those already successful, who can p̶a̶y̶ "perform more experiments when requested" their way out of review hell, but we're ignoring a lot of good work simply for lack of m̶o̶n̶e̶y̶ compute. It slows down our progress to reach AGI.

pama 821 days ago

I agree with almost all you said except that Twitter is better than top conferences, and I take a contrarian view that reviewers slow down AGI with requests for additional experiments. Without going into specifics, which you can probably guess based on your background, too many ideas that work well, even optimally, at small scale fail horribly at large scale. Other ideas that work at super specialized settings don’t transfer or don’t generalize. The saving of two or three dimensions for exact symmetry operations is super important when you deal with handful of dimensions and is often irrelevant or slowing down training a lot when you already deal with tens of thousands of dimensions. Correlations in huge multimodal datasets are way more complicated than most humans can grasp and we will not get to AGI before we can have a large enough group of people dealing with such data routinely. It is very likely detrimental for our progress to AGI that we lack abundant hardware for academics and hobbyists to contribute frontier experiments, however we don’t do anybody a favor by increasing the entropy of the publications in the huge ML conferences. This particular work listed in HN stands out despite lack of scaling and will probably make it in a top conference (perhaps with some additional background citations) but not everything that is merely novel should simply make it to ICLR or neurIPS or ICML, otherwise we could have a million papers in each in a few years from today and nobody would be the wiser.

godelski 821 days ago

> too many ideas that work well, even optimally, at small scale fail horribly at large scale.

Not that I disagree, but I don't think that's a reason to not publish. There's another way to rephrase what you've said

  many ideas that work well at small scales do not trivially work at large scales

But this is true for many works, even transformers. You don't just scale by turning up model parameters and data. You can, but generally more things are going on. So why hold these works back because of that? There may be nuggets in there that may be of value and people may learn how to scale them. Just because they don't scale (now or ever) doesn't mean they aren't of value (and let's be honest, if they don't scale, this is a real killer for the "scale is all you need" people)

> Other ideas that work at super specialized settings don’t transfer or don’t generalize.

It is also hard to tell if these are hyper-parameter settings. Not that I disagree with you, but it is hard to tell.

> Correlations in huge multimodal datasets are way more complicated than most humans can grasp and we will not get to AGI before we can have a large enough group of people dealing with such data routinely.

I'm not sure I understand your argument here. The people I know that work at scale often have the worst understanding of large data. Not understanding the differences between density in a normal distribution and a uniform. Thinking that LERPing in a normal yields representative data. Or cosine simularity and orthogonality. IME people that work at scale benefit from being able to throw compute at problems.

> we don’t do anybody a favor by increasing the entropy of the publications in the huge ML conferences

You and I have very different ideas as to what constitutes information gain. I would say a majority of people studying two models (LLMs and diffusion) results in lower gain, not more.

And as I've said above, I don't care about novelty. It's a meaningless term. (and I wish to god people would read the fucking conference reviewer guidelines as they constantly violate them when discussing novelty)

pama 821 days ago

I think information gain will be easy to measure in principle with an AI in the near future: if the work is correct, how unexpected is it. Anything trivially predictable based on published literature, including exact reproduction disguised as novel is not worthy of too much attention. Anything that has a change of changing the model of the world is important. It can seem minor even trivial to some nasty reviewer, but if the effect is real and not demonstrated before then it deserves attention. Until then, we deal with imperfect humans.

Regarding large multimodal data, I don’t know what people you refer to, so I can’t comment further. The current math is useful but very limited when it comes to understanding the densities in such data; vectors are always orthogonal at high dim and densities are always sampled very poorly. The type of understanding of data that would help progress in drug and material design, say, is very different from the type of data that can help a chatbot code. Obviously the future AI should understand it all, but it may take interdisciplinary collaborations that best start at an early age and don’t fit the current academic system very well unfortunately.

abhgh 821 days ago

Yes arxiv is a good first source too. I mentioned conferences as a way to get exposed to diversity, but not necessarily (sadly) merit. It has been my experience as an author and reviewer both that review quality has plummeted over the years for the most part. As a reviewer I had to struggle with the ills of "commission and omission" both, i.e., (a) convince other reviewers to see an idea (from a trendy area such as in-context learning) as not novel (because it has been done before, even in the area of LLMs), and (b) see an idea as novel, which wouldn't haven't seemed so initially because some reviewers weren't aware of the background or impact of anything non-LLM, or god forbid, non-DL. As an author this has personally affected me because I had to work on my PhD remotely, so I didn't have access to a bunch of compute and I deliberately picked a non-DL area, and I had to pay the price for that in terms of multiple rejections, reviewer ghosting, journals not responding for years (yes, years).

godelski 821 days ago

I've stopped considering novelty at all. The only thing I now consider is if the precise technique has been done before. If not, well I've seen pretty small things change results dramatically. The pattern I've seen that scares me more is that when authors do find simple but effective changes, they end up convoluting the ideas because simplicity and clarity is often confused with novelty. And honestly, revisiting ideas is useful as our environments change. So I don't want to discourage this type of work.

Personally, this has affected me as a late PhD student. Late in the literal sense as I'm not getting my work pushed out (even some SOTA stuff) because of factors like these and my department insists something is wrong with me but will not read my papers, the reviews, or suggest what I need to do besides "publish more." (Literally told to me, "try publishing 5 papers a year, one should get in.") You'll laugh at this, I pushed a paper into a workshop and a major complaint was that I didn't give enough background on StyleGAN because "not everyone would be familiar with the architecture." (while I can understand the comment, 8 pages is not much room when you gotta show pictures on several datasets. My appendix was quite lengthy and included all requested information). We just used a GAN as a proxy because diffusion is much more expensive to train (most common complaints are "not enough datasets" and "how's it scale"). I think this is the reason so many universities use pretrained networks instead of training things from scratch, which just railroads research.

(I also got a paper double desk rejected. First because it was "already published." Took a 2 months for them to realize it was arxiv only. Then they fixed that and rejected again because "didn't cite relevant works" with no mention of what those works were... I've obviously lost all faith in the review process)

pama 821 days ago

Sorry to hear all this (after writing my other sibling comment). Please don’t lose faith in the review process. It is still useful. Until the AGI can be better reviewers, which is hopefully not too far in the future.

abhgh 821 days ago

Sorry to hear that. My experiences haven't been very different. I really can't tell if the current review process is the least bad among alternatives or is there something better (if so, what is it?).

versteegen 821 days ago

> I've saved a Welling paper from rejection from two reviewers who admitted to not knowing PDEs

Thank you for fighting the good fight.

This is why I love OpenReview, I can spot and ignore nonsensical reviewer criticisms and ratings and look for the insightful comments and rebuttals. Many reviewers do put in a lot of very valuable work reading and critiquing most of which would go to waste if not made public.

godelski 821 days ago

I like OR too and I wish we would just post to there instead. It has everything we need, and I see no value from the venues. No one wants to act in good faith and they have every incentive not to.

And I gotta say, I'm not going to put up a fight much longer. As soon as I get out of my PhD I intend to just post to OR.

cs702 821 days ago

> never make it to HN unless there's special circumstances

Yes, I agree. The two most common patterns I've noticed in research that does show up on HN are: 1) It outright improves, or has the potential to improve, applications currently used in production by many HN readers. In other words, it's not just navel-gazing. 2) The authors and/or their organizations are well-known, as you suggest.

godelski 821 days ago

What bothers me the most is that comments will float to the top of a link that's an arxiv paper or uni press where people will talk about how something is still in a prototype stage and not production yet/has a ways to go to production. While this is fine, that's also the context of works like these. But it is the same thing that I see in reviews. I've had works myself killed because reviewers treat the paper as a product rather than... you know... research.

sevagh 821 days ago

For example, I find Spike Neural Networks to be cool, but until they reach SOTA, how can they displace conventional neural networks?

godelski 821 days ago

Compare how much time has been spent studying the two different architectures. Who knows if SNNs can displace other stuff, but I wouldn't rely on SOTA for being the benchmark. Progress has to be made and it isn't made in leaps and bounds. If you find them cool, study them more. Maybe you'll stumble onto something. Maybe you'll find an edge in a niche domain (and maybe you find that that edge can generalize more than you initially thought).

Stop worrying about displacing conventional networks and start worrying about understanding things. We chip away at this together, as a community. There's a lot we need to learn and a lot that needs to be explored. Why tie anyone's hands behind their backs?

sevagh 821 days ago

I'm no stranger to having written papers that follow my own curiosity that didn't show any promising results.

However, I wouldn't blame "the community" for not taking my idea and building on it. There needs to be a seed of hope, a taste of future benefits, or else why is it anybody's obligation to care about something subpar?

The introducer of a novel idea needs to beat the incumbent by a large margin. This is just reality, not injustice.

samus 821 days ago

The incumbent approaches usually benefit from a ton of research that might or might not be transferrable to the newcomer.

Even if many optimizations also apply to the new approaches, taking advantage of them takes a lot of work. For example, I have not yet implemented KV caches for my nanoGPTs that I'm fooling around with.

godelski 820 days ago

> The introducer of a novel idea needs to beat the incumbent by a large margin. This is just reality, not injustice.

It is an injustice and an impedance to scientific progress.

It is also a very odd thing to see in any technological progress. This is not a normal process btw. Generally we see S-curves and the newer technology is initially worse. That should be unsurprising given that it has had far less time and far less attention. You have to look at the potential and see if things are worth pursuing. We should not expect that to be carried by one team. If we do, we'll only have the lucky, crazy, and the big leading. That's not a great thing for science, especially if we want to claim that it is on the merit of ideas, not status.

beagle3 822 days ago

I read a book on NNs by Robert Hecht Nielsen in 1989, during the NN hype of the time (I believe it was the 2nd hype cycle, the first beginning with Rosenblatt’s original hardware perceptron and dying with Minsky and Pappert’s “Perceptrons” manuscript a decade or two earlier).

Everything described was laughably basic by modern standards, but the motivation given in that book was the Kolmogorov representation theorem: a modest 3 layer networks with the right activation function can represent any continuous m-to-n function.

Most research back then focused on 3 layer networks, possibly for that reason. Sigmoid activation was king, and vanishing gradients the main issue. It took 2 decades until AlexNet brought NN research back from the AI winter of the 1990’s

glebnovikov 822 days ago

> Everyone is getting tired of those papers.

This is science as is :)

95% percent will produce mediocre-to-nice improvements to what we already have so there were reserachers that eventually grow up and do something really exciting

godelski 822 days ago

Nothing wrong with incremental improvements. Giant leaps (almost always) only happen because of a lack of your niche domain expertise. And I mean niche niche