| HN Mirror

Thank you for your comment, and I sincerely apologize for my slow response! "Rediscovering structure" is exactly the inefficiency I was trying to highlight.

In the physics/science cases I work with, the factorization is usually between the physical law (shared structure) and the experimental conditions (dataset-specific structure). If you don't separate them, the model wastes capacity trying to memorize the noise of the experimental conditions. (It's ineffective as well as wasteful.)

The analogy to code generation makes a lot of sense: flattening a tree into a sequence forces the model to infer syntax that was already explicit. Thank you for the link; I look forward to diving into it!