Hacker News new | ask | show | jobs
by tavavex 1037 days ago
Very exciting. The "first-generation" Stable Diffusion frontends seem to have settled on a specific design philosophy, so it's interesting to see new tools (like this or ComfyUI) shake up the way people work with this tool. I hope that in a few years, we'll know which philosophy works best.
4 comments

Out of all the AI-related tools, generative art frontends are probably the thing most likely to radically change and improve in the next few years.

It's specifically why I've avoided diving too deep into "prompt engineering", because the kind of incantations required today just aren't going to be the way most people interact with this stuff for very long.

> Out of all the AI-related tools, generative art frontends are probably the thing most likely to radically change and improve in the next few years.

The difference between UIs is actually not very relevant today; by now the generic workflow for complex scenes is more or less obvious to anyone who spent time with SD.

- Draw basic composition guides. Use them with controlnets or any other generic guidance method to enforce the environment composition you want. Train your own controlnet if you need something specific. (lots of untapped potential here)

- Finetune the checkpoint on your reference pictures or use other style transfer methods to enforce the consistent style.

- Use manual brush masking, manually guided segmentation (ex. SAM), or prompted segmentation (ex ClipSEG) to select the parts to be replaced with other objects. The choice depends on your case and need to do it procedurally.

- Photobash and add detail to the elements of your scene using any composition methods you have (noisy latent composition, inpainting etc) with the masks you created in the previous step. Use advanced guidance (controlnets, t2i adapters etc)

- Don't bother with any prompts beyond very basic descriptions, as "prompt engineering" is slow and unreliable. Don't overwhelm the model by trying to fit lots of detail in one pass; use separate passes for separate objects or regions.

- Alternative 3D version: build a primitive 3D scene from basic props (shapes, rigs). Render the backdrop and separate objects into separate layers as guides. Use them with controlnets & co to render the scene in a guided manner, combining the objects by latent composition, inpainting, or any other means. This can be used for procedural scenes and animation (although current models lack temporal stability).

As long as your tool has all that in one place, it's a breeze, regardless of the UI paradigm (admittedly auto1111's overloaded gradio looks straight out of a trash compactor nowadays). I expect 2D/3D software integrations being the most successful in the future, as they already offer proven UIs and most desirable side features. The problem is that in the current state SD can't do much in the production setting, it's not a finished product - so there's not a lot of interest in software integrations just yet.

Thanks for sharing this detailed guide. Can you share an example of the type of resulting image you’ve generated using the above approach?

I’ve only just used Dall-E or SD with basic prompts, or sometimes using photoshop afterward. I’m curious what you’ve been able to come up with using your more complex pipeline.

vizcom.ai ;)
Wow that is awesom... I'd kill my $30/mo sub to midjourney if this thing were $30/mo for individuals...
As a commercial artist that's worked in several professional creative industries, I find the current textual methods of interacting with generative image AI to be unusable for the vast majority of professional tasks. I think they're great for a lot of laypeople because they abstract away things that laypeople don't want to have to think about— but in professional workflows, you need specificity at pixel-level granularity, predictability, and repeatability. Those things are all difficult with purpose-built tools and impossible through text prompts. I haven't spoken to a single colleague that doesn't work in high-volume, low-effort end of their disks/fields that disagree. Most commercial artists selling point is deciding exactly what should go into a piece, and implementing it is the easy part.

The pro tools that have incorporated generative AI into their workflows are not at all textual. The environment that popularizes this among the general public will look a lot more like canva or maybe Instagram than what's popular now.

At some level I agree that the prompt engineering done today to break ChatGPT guard rails are things that barely rise to “interesting hack” levels, but I think that manipulating language to induce specific behavior by an LLM is a powerful skill, and requires a very facile understanding of language in the semantic context of the training corpus. By varying the tone, vocabulary, style, pacing, and obviously the semantics of the original inducing language you can dramatically change the behavior of the LLM. This is less about prompt engineering and being a masterful manipulator of language - and why I don’t fear that LLMs make language skill irrelevant. Those with the most language skill will produce the most compelling and tailored LLM output for a purpose.
It’s entirely likely that there’s much more effort going into generative text - any perceived advancement of generative images is going to be disproportionately skewed due the richness of information that they hold.
Incantations are fun!
Photoshop Beta does it best. The generative features are just new tools that work as you’d expect with all the existing tools. For example, if you want to do outpainting, just make your canvas bigger and you get a contextual menu where you can (optionally) type a prompt. Inpainting, just make a selection however you want and type a prompt.
The control that offers is extremely limited versus SD in A1111 with all it's different models, LoRA's, embeddings, extensions and ControlNet types.
I wrote a typescript API generator for ComfyUI, works great - hopefully will have time to release it soon.

I think there's so much unexplored potential in UI and workflows around generative AI, we've barely scratched the surface. Very exciting times ahead!

I bet this will be available as an Automatic1111 extension by end of month.
I'm doubtful about that. A1111 is what I called a "first-generation frontend". Both it and all of its extensions follow a specific model for its usage - in general, every tool is contained on its own tab, with each tab having buttons to transfer the outputs into other tools. Radically changing this model would require rewriting so much that it'd just make sense to use a different frontend in the first place.