| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by samsartor 281 days ago
	I'm skeptical that we'll see a big breakthrough in the architecture itself. As sick as we all are of transformers, they are really good universal approximators. You can get some marginal gains, but how more _universal_ are you realistically going to get? I could be wrong, and I'm glad there are researchers out there looking at alternatives like graphical models, but for my money we need to look further afeild. Reconsider the auto-regressive task, cross entropy loss, even gradient descent optimization itself.

4 comments

kingstnap 281 days ago

There are many many problems with attention.

The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].

I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].

[1] https://arxiv.org/abs/2309.17453

[2] https://arxiv.org/abs/2410.01104

[3] https://arxiv.org/abs/2505.17190

[4] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5240330

[5] https://arxiv.org/abs/2406.04267

[6] https://arxiv.org/abs/2410.23506

[6] https://arxiv.org/abs/2508.21038

link

ACCount37 281 days ago

If all your problems with attention are actually just problems with softmax, then that's an easy fix. Delete softmax lmao.

No but seriously, just fix the fucking softmax. Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that, or replace softmax with any of the almost-softmax-but-not-really candidates. Plenty of options there.

The reason why we're "benchmaxxing" is that benchmarks are the metrics we have, and the only way by which we can sift through this gajillion of "revolutionary new architecture ideas" and get at the ones that show any promise at all. Of which there are very few, and fewer still that are worth their gains when you account for: there not being an unlimited amount of compute. Especially not when it comes to frontier training runs.

Memorization vs generalization is a well known idiot trap, and we are all stupid dumb fucks in the face of applied ML. Still, some benchmarks are harder to game than others (guess how we found that out), and there's power in that.

link

thousand_nights 280 days ago

reason we're benchmaxxing is because there's a huge monetary incentive now to have the best performing model on these synthetic benchmarks and that status is worth a lot of money

literally every new release of something point X model of every major player includes some benchmark graphs to show off

link

mycall 280 days ago

benchmaxxing has also been identified as one of the causes of hallucination.

link

svnt 280 days ago

hallucination is just built in, what am I missing?

link

ACCount37 280 days ago

That LLMs have some basic metaknowledge and metacognitive skills that they can use to reduce the hallucination rate.

Which is what humans do too - it's not magic. Humans just get more metacognitive juice for free. Resulting in a hallucination rate significantly lower than that of LLMs, but significantly higher than zero.

Now, having the skills you need to avoid hallucinations is good, even if they're weak and basic skills. But is an LLM willing to actually put them to use?

OpenAI cooked o3 with reckless RL using hallucination-unaware reward calculation - which punished reluctance to answer and rewarded overconfident guesses. And their benchmark suite didn't catch it, because the benchmarks were hallucination-unaware too.

link

skissane 280 days ago

> Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that

Not familiar with this topic, but intrigued-anywhere I can read more about it?

link

ACCount37 280 days ago

Looked for it briefly, think the best I got is this older discussion:

https://news.ycombinator.com/item?id=44834918

link

qcnguy 280 days ago

OpenAI have talked about it. The neural architecture needs to let the model handle the case where there's nothing worth attending to, as softmax requires attention to be allocated to all tokens but sometimes there's nothing worth it.

link

mxkopy 280 days ago

I agree, gradient descent implicitly assumes things have a meaningful gradient, which they don’t always. And even if we say anything can be approximated by a continuous function, we’re learning we don’t like approximations in our AI. Some discrete alternative of SGD would be nice.

link

eldenring 281 days ago

I think something with more uniform training and inference setups, and otherwise equally hardware friendly, just as easily trainable, and equally expressive could replace transformers.

BDH

Yeah that thing is quite interesting - baby dragon hatchling https://news.ycombinator.com/item?id=45668408 https://youtu.be/mfV44-mtg7c

link