| HN Mirror

So, technically, and beyond that, theoretically, and even anatomically.

In the brain, visual processing, left hemisphere was found to contain details; a right hemisphere to contain structural relations. So a whole is composed of elements and relative positions.

In Convolutional Neural Networks, "near, direct" layers contain analytic detail and "far, abstract" layers contain synthetic shapes.

So, implementation-wise, you can take e.g. descriptions as abstracts and a "pre-acquired" memory of details as «graphic».

Edit:

About the "combination", well that the whole purpose of this new technology proposal,

"ControlNet"

- i.e., formerly you may have had some "transformer" from input to output, and now "conditional controls" are added (through a "zero-convolution" technique) - see Adding Conditional Control to Text-to-Image Diffusion Models - https://arxiv.org/abs/2302.05543