| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vessenes 850 days ago
	I wasn't sure if this paper was parody on reading the abstract. It's not parody. Two things stand out to me: first is the idea of distilling these networks down into a smaller latent space, and then mucking around with that. That's interesting, and cross-sections a bunch of interesting topics like interpretability, compression, training, over- and under-.. The second is that they show the diffusion models don't just converge on similar parameters as the ones they train against/diffuse into, and that's also interesting. I confess I'm not sure what I'd do with this in the random grab bag of Deep Learning knowledge I have, but I think it's pretty fascinating. I might like to see a trained latent encoder that works well on a bunch of different neural networks; maybe that thing would be a good tool for interpreting / inspecting.

4 comments

daxfohl 850 days ago

Seems like it could be useful for resizing the networks, no? Start with ChatGPT 4 then release an open version of it with much fewer parameters.

Or maybe some metaparameter that mucks with the sizes during training produces better results. Start large to get a baseline, then reduce size to increase coherence and learning speed, then scale up again once that is maxed out.

link

SubiculumCode 850 days ago

Perhaps doing this to generate 10 similar but different versions of a model can then be fed into mixture of experts?

link

vessenes 849 days ago

Ooh that’s a good idea! Although mistral seems to have been seeded with identical copies of mistral, so maybe it doesn’t buy you much? Sounds worth trying though!

link

SubiculumCode 849 days ago

The deep problem of my life: I'm interested in so many things, but only have time to pursue one hobby and one neuroscience career. If it is indeed a good idea, its only from connecting gleaned generalizations with other gleaned generalizations; but the devil is often in the details; and I will never have enough time to try myself. :)

link

daxfohl 846 days ago

Or a good way to teleport out of local minima while training. Create a few clones and take the one with the steepest gradients.

link

namibj 848 days ago

Hmmm, I could think of using it to update a DDPM with a conditioning input as the dataset expands from an RL/online process, without ruining the conditioning mechanism that's only trainable through the actual RL itself.

I.e., self-supervised training is done to produce semantically sensical results, and the RL-trained conditioning input steers to contextually useful results.

(Btw., if anyone has tips on how to not wreck the RL training's effort when updating the base model with the recently encountered semantically-valid training samples that can be used self-supervised, please tell. I'd hate to throw away the RL effort expended to aquire that much taking data for good self-supervised operation. It's already looking fairly expensive...)

link

daxfohl 847 days ago

You could use this and try to tease out something similar to https://news.ycombinator.com/item?id=39487124, but for NNs instead of images. Maybe it's possible to have this NN diffusion model explain the pieces of the NN they generate and why parameters have those values.

If we can get that, then maybe we don't even need to train anymore; it'd be possible to start to generate NNs algorithmically.

link