Hacker News new | ask | show | jobs
by Magi604 968 days ago
SD outputs have an "uncanny valley" type of quality to them. You just KNOW when an image is from SD. And I have looked at getting started with SD, but the requirements and setup and +-prompting "language" just kind of turned me off the whole thing.

Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.

I guess my point is to ask whether SD is worth bothering with at this time when DALL E and Imagen and possibly others are just on the brink of becoming mainstream and just going to get better and better. Clunking together something with SD seems unnecessary when you can generate more results, better results, in a faster way, with less requirements, and without the steep learning curve, by using other methods.

12 comments

One major benefit and the reason why I use the StableDiffusion tools and models is because I can run them at home on my relatively old NVIDIA 2080 GPU with 8GB of VRAM. Costs me nothing (besides electricity).

Depends if you value this kind of freedom in life.

You can do some things such as colorizing black and white images with the Recolor model.

https://huggingface.co/stabilityai/control-lora

I have to agree at how convenient and (long term) inexpensive this can be. I may not always get the greatest results right away, but it is fun to come up with some ideas, put them into a prompt iterator (or matrix), and run it overnight. I can tweak it to my heart's content.
Very interesting - thank you for sharing this. Would love to explore this as a team and perhaps put out a blog on helping others get started with control-lora
I mean, I'm running DALLE 3 on a browser from an old laptop and I've generated probably over 15k images in 2 weeks, spanning the gamut from memes to art to lewds (with jailbreaks). The ability to completely scrap what you're building and start totally fresh at the drop of a hat with a new line of ideas and get instant results seems pretty freeing to me.
That’s fine, but it’s like asking: “Why would anyone want to have a personal website when you can just write stuff on Facebook and Twitter and it’s so much easier?”

Stable Diffusion is an open model that you can run locally on your own computer without anyone’s permission. Dall-E is a closed model that runs on OpenAI’s very expensive server farm, and they can change how it works and what it costs whenever they please.

Right now AI is in the Uber-style expansion phase where the service is practically given away to conquer market share. Once the hypergrowth is over, OpenAI will start raising their prices just like Uber did.

With SD I can generate at least 15k images daily on my old laptop, I can train it with new styles, characters, real people, etc.; download thousands of new styles, characters, real people, etc. from Civitai, and best of all, never worry about ever losing access to it, being censored, having to jailbreak it, being snooped on, etc.

Plus a million other tools that the community has made for it, like ControlNet or things like AnimateDiff to create videos. I can also easily create all kinds of scripts and workflows.

I'm using Dall-e 3 through ChatGPT but it seems to limit the amount of images I can generate per half hour. I haven't figured out the actual limit but sometimes I go to generate an image and it just says "You've reached your image generation cap please wait _n_ minutes before trying again"

Are you getting around that somehow? Even if it'll let me generate 36 images per half hour (which seems like it's probably lower than that) I can only generate 6k in 2 weeks prompting 24/7. I'm not scrutinizing your numbers I'm more hoping I'm missing some way to not have to be capped. I already pay for GPT+

When you run out of boost tokens, if you clear your bing search history and restart your browser you get fresh boost tokens. I've been able to do this endlessly. Also if you use it during non-peak hours the wait times are usually 30 seconds or under for 3-4 image generations.
Sweeeet! Thank you!
You just KNOW when an image is from SD

No, you know when a beginner generated an image in Stable Diffusion. With enough skill and attention, you will not.

Sure, there is a learning curve and it takes more time to get to a good result. But in turn, it gives you control far beyond what the competition can offer.

I’m assuming you haven’t used SDXL?

Give it a go with invokeAI - you can create images that I guarantee you wouldn’t know were generated. Like anything (photography included) it’s a skill.

Examples:

  - https://civitai.com/images/2862100
  - https://civitai.com/images/2339666
  - https://civitai.com/images/2846876
I can see at least three finger issues with the couple in the cinema.

More than that though: I use SDXL quite a bit for fun, and while I like it and it can be very good, it's still prone to getting stuck in a David Cronenberg mode for reasons I can't solve.

Oh yeah it’s not one-shot perfect but it gets you 90% of the way there for a lot of things. I’m super impressed with it.
At glance I get uncanny valley from two. After looking closer it's likely because with the photo of the couple at the cinema, the woman's arm around him is wearing the wrong clothes. Then photo of the guy with a hat, his neck piece is asymmetric.
That's still a 1 in 3 success rate, at the cost of writing a prompt and waiting a minute.
The eyes are messed up in the second one too which instantly gives it away
I can run Stable Diffusion on my local machine. It is open source and weights are public, giving me in theory access to anything I want to modify.

I cant change anything on DALL E, I can just take the input or change the prompt.

Also it is a centralized service that can be shut down, modified, censored or become very expensive at any time.

Try SDXL. Find a good negative prompt, then just put a short sentence (starting with the kind of image, such as photograph, render, etc.) describing what you want in the positive prompt. It is much simpler and has fantastic results. Tweak to your hearts desire from there.

If you see a part of the scene that looks weird (and you know what it should be) add it to your prompt. For example, if you want "photo of a jungle in South America", and the foliage looks weird, add something like "with lush trees and ferns".

Try: https://github.com/lllyasviel/Fooocus

I also recommend a good photorealistic base model, like RealVis XL.

In my experience its like DALL E but straight up better, more customizable, and local. And thats before you start trying finetunes and LORAs.

Other UIs will do SDXL, but every one I tried is terrible without all those default fooocus augmentations.

SDXL is great but it's in no way better than DALL E as far as straight text-to-image goes apart from the lack of censorship.

It has plenty of other advantages, but you can't tell it "make me a cute illustration of a 2 year old girl with Blaze from Blaze and the Monster Machines on a birthday cake with a large 2 candle on it."

DALL E will nail that, more or less. SDXL very much won't.

Here's what I got, pasting your prompt in DALL-E 3:

- https://ibb.co/k0NCWG7 - https://ibb.co/Vm3GZcR - https://ibb.co/bvSC4w3 - https://ibb.co/VqSdYbZ

I'm surprised that it didn't complain about copyrighted characters, it tends to do that a lot for me.

I used that as an example as I recently asked for it. I did find I had to tell it that "monster" in the title referred to monster trucks, not actual monsters. That helped it not put actual monsters in (as yours are half Blaze/half monsters), though my generations were way better at doing Blaze than yours were - they just had cute little monsters around too.
SD XL understands prompts much better than 1.5. So the next version of SD might be comparable to Dall-E without censorship.
Heh, yeah, that is true

https://ibb.co/1m0bLWC

More cherrypicking and messing with styles is getting closer, but nothing like Dall-E's first try I'm sure.

> You just KNOW when an image is from SD.

You don't. People think they do, but they don't.

DALL-E within ChatGPT uses GPT-4 to rewrite what you ask for into a good text-to-image prompt. You could probably do something similar with Stable Diffusion with just a little upfront effort tuning that system prompt.
Somewhat, but dalle3 is hugely better at understanding a description and relationships.
LLMs in general are, and that can be leveraged by using an LLM to set up layout for Stable Diffusion.

https://github.com/TonyLianLong/LLM-groundedDiffusion

> You could probably do something similar with Stable Diffusion with just a little upfront effort tuning that system prompt.

And, indeed, someone has:

https://github.com/sayakpaul/caption-upsampling

Depends what you want.

Dalle 3 is super good, but lacks the creative control controlnets and ip-adapter provide. So for instance afaik there is no way to perform style transfers, or ’paint a van gogh portrait over my pencil sketch’.

Both are good currently but at different things.

”Prompt engineering” is and will be total bs. Dalle3/chatgpt provides the actual workflow we want where we describe to the intelligent agent (chatpgt) what we want and it worries over the accidental-complexity-intricasies of the clip model itself.

Dall-E has the same problems as other models. Try generating a clockwork mechanism with it, for example.

SD is worth bothering with because it's open, you can run and extend it yourself.

You know when they're bad enough that you know.
That's funny to hear because DALL-E 3 mainly improves prompt understanding, it hallucinates like mad with faces and hands, and doesn't seem to do anything to improve them like Midjourney for example.

>Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.

Hyper-realistic, but is it what you want from it? Are you able to guide it into doing exactly what you want? If you have such requirements that just a natural language prompt is enough and is somehow faster than sketching and providing references, of course use it. I'm not so lucky, I don't get what I want from it, and no amount of prompt understanding will make it easier. Although SD/SDXL doesn't pass the quality bar either, not because it's not "detailed" or "hyper-realistic" enough, but because it doesn't pay attention to the things that should be prioritized, like linework or lighting. Neither does any other model. Controlnets and LoRAs alone aren't sufficient for controllability either, mostly because it's too small to understand high-level concepts. So I don't use anything.