So, technically, and beyond that, theoretically, and even anatomically.
In the brain, visual processing, left hemisphere was found to contain details; a right hemisphere to contain structural relations. So a whole is composed of elements and relative positions.
In Convolutional Neural Networks, "near, direct" layers contain analytic detail and "far, abstract" layers contain synthetic shapes.
So, implementation-wise, you can take e.g. descriptions as abstracts and a "pre-acquired" memory of details as «graphic».
Edit:
About the "combination", well that the whole purpose of this new technology proposal,
"ControlNet"
- i.e., formerly you may have had some "transformer" from input to output, and now "conditional controls" are added (through a "zero-convolution" technique) - see Adding Conditional Control to Text-to-Image Diffusion Models - https://arxiv.org/abs/2302.05543
Is the graphic used as a graphic?