I think the rationale for using tricks like score matching and contrastive divergence deserves a mention: the partition function is computationally expensive.
Since we're on the subject, what are EBMs good for today?
You can take many equivalent perspectives on learning systems, but mostly it reduces to "messing with denominators in Bayes' rule". This is no different.
EBMs today aren't used because first you have to fit the joint model, then you have to fix some inputs, then fit the other inputs in a second optimization step. That's just too much compute for today's workloads compared to feedforward NNs.
- Simplicity and Stability: An EBM is the only object that needs to be trained and designed. Separate networks are not tuned to ensure balance.
- Sharing of Statistical Strength: Since the EBM is the only trained object, it requires fewer model parameters than approaches that use multiple networks.
- Adaptive Computation Time: Implicit sample generation is an iterative stochastic optimization process, which allows for a trade-off between generation quality and computation time.
- VAEs and flow-based models are bound by the manifold structure
of the prior distribution and consequently have issues modelling discontinuous data manifolds, often assigning probability mass to areas unwarranted by the data. EBMs avoid this issue by directly modelling particular regions as high or lower energy.
- Compositionality: If we think of energy functions as costs for a certain goals or constraints, summation of two or more energies corresponds to satisfying all their goals or constraints.
As far as I can tell, flow-based models are bound by the exact same requirements as energy based models (flow = diffusion/normalizing flow/flow-matching models). But they're absolutely right about VAEs. Those are a memetic virus that need to die off in favor of more theoretically grounded encoders.