The difficulty of prompt engineering cannot be underestimated. You have to try lots of variations and iterate heaps on the prompt. I usually generate several hundred images to pick from and evaluate based on connection to the topic, coherence + esthetic.
For many generated images, people don't have a specific target in mind and let themselves be surprised (which is fun!), but it's quite difficult to take a given topic, write a prompt and get back a coherent image that is on topic.
Is there anything out there on “what works” and doesn’t, what those challenges look like in iteration etc? It sounds like an interesting skill frankly.
I did something along these lines, but a realtime ghetto version. I had a computer that sat there with windows speech to text API constantly running. Once it reached some word count or time span, it would do a google meme gif image search, and show the top four hits. It was pretty amusing when it would get things completely wrong.
Something similar would make an interesting art installation.
Another fun art installation would be a camera system attempting to recognize people and logos and items in real time and send those string of converted text items to Dalle.
This is really great. Lots of the images were really striking! I'd love to see something like this scaled up, maybe pulling headlines from multiple sources to generate different images for the same story?