|
|
|
|
|
by Jack000
1087 days ago
|
|
Not seeing anything about the dataset, are they still using LAION? There's no mention of LAION in the paper and the results look quite different from 1.5 so I'm guessing no. > the model may encounter challenges when synthesizing intricate structures, such as human hands I think there's two main reasons for poor hands/text - Humans care about certain areas of the image more than others, giving high saliency to faces, hands, body shape etc and lower saliency to backgrounds and textures. Due to the way the unet is trained it cares about all areas of the image equally. This means model capacity per area is uniform, leading to capacity problems for objects with a large number of configurations that humans care more about. - The sampling procedure implicitly assumes a uniform amount of variance over the entire image. Text glyphs basically never change, which means we should basically have infinite CFG in the parts of the image that contain text. I'm not sure if there's any point in working on this though, since both can be fixed by simply making a bigger model. |
|