| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by digdugdirk 660 days ago
	Is there any way to replicate this with existing models, or are we going to need to wait for models to be trained in this style? I'm imagining a smaller model examining the output tokens of a larger model and metaphorically slapping it on the wrist with a ruler if the output tokens start drifting off topic. Not quite the same, but an entertaining thought nonetheless.

2 comments

bionhoward 660 days ago

Yes, I believe this is possible, you could clone weights of one or more existing models and fine tune them in groups with different random seeds for noise/drop to produce reasonable outputs under a differential transformer decoding scheme whereby tokens with disagreement receive more attention (surprisal analysis)

link

causal 660 days ago

It's a different attention mechanism with a different map setup, so fundamentally a different type of model

link

om8 660 days ago

Looks like it is a drop in replacement for attention, but models will need to be retrained for this one, yes.

link

aDyslecticCrow 660 days ago

It may not need to be entirely retrained. The value spans and input are the same, and no extra weights are needed. You may be able to tune an existing model with this attention mechanism and get some of the benefits.

But overall... it's mainly a training change, so training is needed to make a difference.

link