Must be my ignorance but everytime I see explainers for LLMs similar to the post, it’s hard to believe that AGI is upon us. It just doesn’t feel that “intelligent” but again might just be my ignorance.
It's never going to be AGI, because we're still stuck in the static weights era.
Just because it is theoretically possible to scale your way through sheer brute force alone using a trillion times the compute doesn't mean that you can't come up with a better compute scaling architecture that uses less energy.
It's the same as having a turing machine with one tape vs multiple tapes. In theory it changes nothing, in practice having even the simplest algorithms be quadratic is a huge drag.
The problem with previous AI approaches is that humans wanted to make use of their domain expertise and ended up anthropomorphizing the ML models, which resulted in them being overtaken by people who invested little in domain expertise and more into compute scaling. The quintessential bitter lesson. With the advent of the bitter lesson, people who don't understand anything at all except the concept "bigger is better" arrived, and they think that they can wring out blood from a stone. The problem they run into is that they are trying to get something out of compute scaling that you can't get out of compute scaling.
What they want to do is satisfy a problem definition using an architecture that is designed to solve a completely different problem definition. The AGI compute scaling crowd wants something that is capable of responding and learning through experience, out of something that is inherently designed and punished to not learn through experience. The key aspect "continual learning" does not rely on domain knowledge. It is a compute scaling paradigm, but it's not the same compute scaling paradigm that static weights represent. You can't bet on donkeys in a horse race and expect to win, but since everyone is bringing donkeys to the race it sure looks like you can.
My personal bet is that we will use self referential matrices and other meta learning strategies. The days of hand tuning learning rates to produce pre-baked weights should be over by the end of the decade.
Because LLMs successfully emulate a subset of our brain's functions: memory and imagination (the generative/mixing function). What's missing is our brain's ability to validate the generative output against a model of the environment described by memory and output (the real world), which is built on sensory input. In short, we have a concept of true/false, LLMs don't.
LLMs emulate language by following intricate links between tokens. This is not meant to emulate memory or imagination, just transforming a list of tokens into another list of tokens, generating language. And language is a huge part of the intelligence puzzle so it looks smart to people despite being quite mechanical.
A next step could be to create a mind, with a piece that works similar to the paretial lobe to give it a sense of self or temporal existence.
> it looks smart to people despite being quite mechanical
Note that brains themselves are also "quite mechanical", as is any physical system or piece of software. "Looks smart", in the limit, reduces to "is smart".
Brains themselves have a lot more mechanisms to cause emergent behavior what with all the adaptive organic layers so I can't really compare the two 1-1.
eh, transformers are universal differentiable layered hash tables. that's incredibly powerful. most logic is just pulling symbols and matching structures with "hash"es.
if intelligence is just reasonable manipulation of logic it's unsurprising that an LLM could be intelligent, what maybe is surprising is that we have ~intelligence without going up a few more orders of magnitude in size, what's possibly more surprising is that training it on the internet got it doing the things it's doing
Any arbitrarily complex system must be made of simpler components, recursively down to arbitrary levels of simplicity. If you zoom in enough everything is dumb.
Biological Neuron: Processes information through complex, nonlinear integration of thousands of excitatory and inhibitory inputs across dendritic trees, producing spiking outputs with rich temporal patterns. It adapts dynamically via synaptic plasticity, neuromodulation, and structural changes, operating in a probabilistic, energy-efficient manner within oscillatory networks.
Artificial Neuron: Performs simple, linear summation of weighted inputs, applies a static activation function, and produces a single scalar output. It lacks temporal dynamics, local plasticity, or neuromodulation, operating deterministically with high computational cost and fixed connectivity.
"Dendrites can implement non‑linear sub‑units and even logic‑gate‑like behavior before the soma integrates them, whereas the standard artificial neuron uses a plain weighted sum."
"Neurotransmitter diversity (e.g., glutamate, GABA, dopamine) allows different semantics on each connection. An artificial edge conveys only a signed scalar."
Just because it is theoretically possible to scale your way through sheer brute force alone using a trillion times the compute doesn't mean that you can't come up with a better compute scaling architecture that uses less energy.
It's the same as having a turing machine with one tape vs multiple tapes. In theory it changes nothing, in practice having even the simplest algorithms be quadratic is a huge drag.
The problem with previous AI approaches is that humans wanted to make use of their domain expertise and ended up anthropomorphizing the ML models, which resulted in them being overtaken by people who invested little in domain expertise and more into compute scaling. The quintessential bitter lesson. With the advent of the bitter lesson, people who don't understand anything at all except the concept "bigger is better" arrived, and they think that they can wring out blood from a stone. The problem they run into is that they are trying to get something out of compute scaling that you can't get out of compute scaling.
What they want to do is satisfy a problem definition using an architecture that is designed to solve a completely different problem definition. The AGI compute scaling crowd wants something that is capable of responding and learning through experience, out of something that is inherently designed and punished to not learn through experience. The key aspect "continual learning" does not rely on domain knowledge. It is a compute scaling paradigm, but it's not the same compute scaling paradigm that static weights represent. You can't bet on donkeys in a horse race and expect to win, but since everyone is bringing donkeys to the race it sure looks like you can.
My personal bet is that we will use self referential matrices and other meta learning strategies. The days of hand tuning learning rates to produce pre-baked weights should be over by the end of the decade.