Hacker News new | ask | show | jobs
by NationOfJoe 1398 days ago
i only have a passing curiosity in these projects personally.

can someone in the field explain why this has exploded recently? there seems to be a lot of these tools released recently (text to image) was there a major breakthrough? a new idea that pushed everyone forward? a recent sharing of talent between groups?

edit: just another thought, are they just being posted to HN now, i don't see a date on the page for when it was released . I also don't know the general term to find a list of all of these to find all the release dates

7 comments

The previous models were either 1. Limited in their capacity to create something that looked very cool, or 2. Gigantic models that needed clusters of GPUs and lots of infrastructure to generate a single image.

One major thing that happened recently (2ish weeks ago) was the release of an algorithm (with weights) called stable diffusion, which runs on consumer grade hardware and requires about 8GB of GPU RAM to generate something that looks cool. This has opened up usage of these models for a lot of people.

example outputs with prompts for the curious: https://lexica.art/

Is Lexica finding results previously computed? Or generating them? I could only work with very simple queries like "photo of a cat".
It's just a database of submitted works I think. You can try scrolling down on the opening page to see random prompts and outputs.
It's ~1.5 million entries inputted by users during the beta period on Discord.
There are a lot of prompts and results that aren't being included. Not sure what the criteria were.
The first major projects were OpenAI's DALL-E, then DALL-E 2 a year later. DALL-E 2 was much much better. After that a few new projects have been released in rapid succession including the open source stable diffusion.

Here are some of the projects on GitHub: https://github.com/topics/text-to-image

Another good source is https://paperswithcode.com/task/text-to-image-generation

Like anyone deeply in a field I know maybe several thousand people who could probably give a better answer, but I figure I'll give an effort to provide one since I don't see any good ones posted yet.

The moment everyone knew this was going to be big was in 2019 when StyleGAN came out. They used a lot of tricks like aligning face features (like eyes) and had all their pictures of a single domain (the most famous being faces) but none the less, that was the moment everyone in the AI field knew this was going to be big, and so three years ago a lot of big people shifted to this line of research.

The four main innovations since then have been:

1. Transformers

Generalized computation kernels which allow for images to consider non-localised relationships between pixels of an image. Released in 2017, and originally used for language.

2. Pixel Patch Encodings

Different resolution semantic and geometric image information encodings which allow for better representations of relationships between image areas than pixels are able to achieve given the same compute. Allows using Transformers on high resolution images.

3. CLIP

Contrastive Language and Image Pairing. Before, the only way we knew to classify an image was as a "face" or "cat" or "ramen". When the genius idea of labeling images as semantically meaningful vectors rather than one hot encoded classes was revealed, it changed everything in computer vision very quickly, and problems that used to be hard became trivial. Released in 2021

4. Diffusion Models

GANs penalise you for making an image which does not seem to be part of an existing dataset. This encourages one to make the worst quality image that looks like a member of that dataset. Diffusion learns to denoise an image, removing noise is perceptually similar to increasing resolution, people like images that look that way. There may be more people with better intuition about diffusion models may be able to add on why they're superior. I've read all the papers leading up to the latest unCLIP (Dalle2) but it's complicated. Released in 2020, with major improvements to the training process continuously being made since then.

Hope this was helpful. All of the above were only implemented for images in any real way in the last three years. Putting them all together is something many people only just this year did, resulting in DallE, Stable Diffusion, and Imagen.

I'm working on doing this for 3D and later for use cases in AR. 3D generation still hasn't been cracked the same way image has but the above will likely contribute to the solution to that as well. Anyone who's intersted in working on that feel free to message me.

> I've read all the papers leading up to the latest unCLIP (Dalle2) but it's complicated. Released in 2020, with major improvements to the training process continuously being made since then.

The models behind Imagen and StableDiffusion are actually simpler than DALLE2, and both are higher quality (SD of course isn’t always since it’s much smaller). That suggests DALLE3 will also be simpler again.

There’s also been very recent work with generalized diffusion models (that use problems other than noise removal and still work) and Google researchers have been tweeting results from a merged Imagen/Parti in the last few days.

Thanks for answering. Since you mentioned your work on text-to-3d, what are the ways to enhance the image/3d model to actually be photo-(or rather reality)-realistic? Even (presumably) hand-picked examples from google on the linked page lack support bars of the sunglasses, include floating cups of wine with base-less Eiffel tower in the background.

P.S. It seems raccoons are unimaginable (even for AI) with any sunglasses: if photo-realistic mode is selected for a raccoon, changing to "wearing a sunglasses and" makes no difference :)

I know as much about how to get the best image outputs from text inputs as the person who designed an airport knows the best place to eat in it. The emergent properties of the system are a result of the data put into it, so I can only discuss the system itself, not what it ended up doing with the data in that system.

The models are a product of their datasets, specifically the relationship of the images and prompts via CLIP. CLIP puts both images and text into coordinate space, imagine just a 2D graph. It tries to assure that for any real image and its caption, they will each be each others closest neighbor in that coordinate space.

So if you want a certain image, you have to ask "what caption would be most likely and most uniquely given to the image I'm imagining".

I'm sure this advice is way less helpful than what you find in prompt engineering discord channels and guides I've seen.

Is 3d a different problem, or a similar one but considerably harder? I'd expect the data encoding (vertices vs pixels) to change a bit about it but I'm not familiar enough to know.
Pixel values are discrete (length x width x r256 x g256 x b256) and vertex values are continuous, so that is one major difference.

Secondly, there's vastly more labeled image data in the world than 3D data, so creating a CLMP (contrastive language and mesh pairing) model is harder.

It's very late but I may be able to give a much better answer on more of the nuances of 3D generation tomorrow.

Would voxels be easier than vertex-based meshes?

I can imagine you'd have the problem of stray floating voxels then, which isn't as noticeable when it happens with 2D pixels.

The “hot new thing” is NeRF, neural radiance fields, which can take into account the way light interacts with the object (and hence you can correlate data from pictures taken at different angles)
Interesting!

I knew about transformers, CLIP and diffusion, but pixel patch encodings are new to me.

Can you give me more details / point me towards an explainer? A quick duckduckgo search didn't help.

I don't quite remember whether it was first used in Vit paper[1], but it's a fairly straight forward idea. You take the patches of an image like they are words in a sentence, reduce the size of the patch(num_of_pixel x num_of_pixel) with a linear projection so that we can actually process it and get rid of sparse pixel information, add in positional encodings to put in location information of the patch and treat them as how you treated words in language models from that point on with transformers. Essentially, words are human constructed but information dense representation of language but images do have quite sparsity in them because individual pixel values don't really change much of an image.

1: https://arxiv.org/pdf/2010.11929.pdf

> reduce the size of the patch(num_of_pixel x num_of_pixel) with a linear projection

What does that mean?

(Thanks for the explanation)

The flattened image patch of width and height PxP pixels gets multiplied with a learnable matrix of dimension P^2xD where D is the size of the patch embedding. In other words, it’s a linear transformation that reduces the dimensionality of the image patch.
Googling the author list, gives me a preprint dated March this year. https://arxiv.org/abs/2205.11487
It is now available to lay people by just typing into a website. Months ago it was rather „use this Jupyter notebook“. So people are now using it for more serious stuff.

For example, here is an RPG designer using Midjourney for illustrations: https://www.bastionland.com/2022/07/primeval-bastionland-pla...

A coworker and I were playing with DALL-E 2 yesterday, and I pointed out that while I don't think any major RPGs are going to be moving away from artists anytime soon, the quality of Ashcan Editions just jumped way, way up.
Not in the field, but maybe diffusion models? They seem to be used by a lot of different image generation techniques.
I don't know. This 'explosion' in the public space may just be the technology crossing a threshold where big $$$ are on the table and competitors rushing in to get hold of good chunks of the emerging market. No tech breakthrough, but a marketing & sales assault.