Hacker News new | ask | show | jobs
by knightoffaith 896 days ago
How can gradient descent work on compute graphs when the space of compute graphs is discrete?
4 comments

> How can gradient descent work on compute graphs when the space of compute graphs is discrete?

You can un-discretize the space of compute graphs by interpolating its points by simplices. More precisely, each graph is a subgraph of the complete graph, and the subgraph is identified by the indicator function of its edges whose values are either 0 or 1. By using weighted edges with values between 0 and 1, the space of all graphs (with the same number of vertices) becomes continuous and connected, and you can gradient move around it in small steps.

Of course, "compute graphs" are more general beasts than "graphs", but it is likely that the same idea will apply. At least, for a reasonably large class of compute graphs.

It can’t. There’s no gradient since it’s not a sufficiently nice space for them. You can use gradient free methods but I’d be shocked if there was an efficient enough way to do that
I don't know if it can in the traditional sense of back propagation.

I think that Hebbian Learning is going to make a comeback at some point and time which will be used to connect static subgraphs to to other subgraphs subgraphs, which can be trained either separately or on the fly.

Perhaps in a way similar to this paper: https://arxiv.org/abs/1806.09055
I wonder why this hasn't taken off.
From brief look at the paper, they are doing a gradient descent of the architecture based on validation loss, which does good for efficiency, but its not ground breaking. The problem is that you are still training towards a target of a correct answer. I don't think this is gonna be applicable in the future, in the sense that we have to train on other things (like logical consistency somehow encoded into the network), as well as correct answers.
Your expectations are pretty high. Differentiable architecture search as you mentioned in the original comment is one thing; going beyond empirical risk minimization-based learning is another thing entirely. In fact, they seem mostly orthogonal.

That aside, it seems like AI has had the most empirical success by not imposing hard constraints/structure, but letting models learn completely "organically". The computationalists (the folks who have historically been more into this "AI has to have things like logical consistency embedded into its structure" kind of thinking) seem to have basically lost, empirically. Who even knows what Soar[1] is nowadays? Maybe some marriage of the two paradigms will lead to better results, but I doubt that things will head in that direction anytime soon given how massively far just having parallelizable architectures and adding more parameters has gotten us.

[1] https://en.wikipedia.org/wiki/Soar_(cognitive_architecture)

They expectations high, but its not so much as orthogonal as more basic. Our brains work on add/multiply/activation this is well known. But the composition of the neural connection strengths in our brain that makes us us is definitely not trained on any sort of final loss. Or at least not completely.
I'm not sure that AI has been successful recently because of its similarities to the human brain. It seems like the project of making human-like AI (in the sense of, models that function similarly to the brain) have had a lot less empirical success than the project of trying minimize loss on a dataset, whatever that takes. Like, look what happened to Hebbian learning, as you mentioned in your other comment. Completely absent from models that are seriously trying to beat SOTA on benchmarks.

Like, it really just seems like LLMs are a really good way of doing statistics rather than the closest model we have of the brain/mind, even if there are some connections we can draw post-hoc between transformers and the human brain.