I think primarily this victimizes all those all ready victimized by the CSAM in the training material and also generally offends the collective sense of morality our society has.
Simplistically and ignorantly speaking, if a diffusion model knows what a child looks like and also knows what an adult woman in a bikini looks like, couldn't it just merge the two together to create a child in a bikini? It seems to do that with other things (ex. Pelican riding a bicycle)
In principle yes, but in practice no: the models don't just learn the abstract space, but also memorise individual people's likenesses. The "child" concept contains little clusters for each actual child who appeared enough times in the dataset. If you tried to do this, the model would produce sexualised imagery of those specific children with distressing regularity.
There are ways to select a specific point or region in latent space for a diffusion model to work towards. If properly chosen, this can have it avoid specific people's likenesses, and even generate likenesses outside the domain of the latent space (which tend to have severe artefacts). However, text prompting doesn't do that, even if the prompt explicitly instructs it to: text-to-image prompts aren't instructions. A system like Grok will always exhibit the behaviour I described in my previous (GP) comment.
As I mentioned in another comment (https://news.ycombinator.com/item?id=46503866), there are other reasons not to produce synthetic sexualised imagery of children, which I'm not qualified to talk about: and I feel this topic is too sensitive for my usual disclaimered uninformed pontificating.