Hacker News new | ask | show | jobs
by Birch-san 872 days ago
I'm one of the authors; happy to answer questions. this arch is of course nice for high-resolution synthesis, but there's some other cool stuff worth mentioning..

activations are small! so you can enjoy bigger batch sizes. this is due to the 4x patching we do on the ingress to the model, and the effectiveness of neighbourhood attention in joining patches at the seams.

the model's inductive biases are pretty different than (for example) a convolutional UNet's. the innermost levels seem to train easily, so images can have good global coherence early in training.

there's no convolutions! so you don't need to worry about artifacts stemming from convolution padding, or having canvas edge padding artifacts leak an implicit position bias.

we can finally see what high-resolution diffusion outputs look like _without_ latents! personally I think current latent VAEs don't _really_ achieve the high resolutions they claim (otherwise fine details like text would survive a VAE roundtrip faithfully); it's common to see latent diffusion outputs with smudgy skin or blurry fur. what I'd like to see in the future of latent diffusion is to listen to the Emu paper and use more channels, or a less ambitious upsample.

it's a transformer! so we can try applying to it everything we know about transformers, like sigma reparameterisation or multimodality. some tricks like masked training will require extra support in [NATTEN](https://github.com/SHI-Labs/NATTEN), but we're very happy with its featureset and performance so far.

but honestly I'm most excited about the efficiency. there's too little work on making pretraining possible at GPU-poor scale. so I was very happy to see HDiT could succeed at small-scale tasks within the resources I had at home (you can get nice oxford flowers samples at 256x256px with half an hour on a 4090). I think with models that are better fits for the problem, perhaps we can get good results with smaller models. and I'd like to see big tech go that direction too!

-Alex Birch

4 comments

Hi Alex Amazing work. I scanned the paper and dusted off my aging memories of Jeremy Howard’s course. Will your model live happily alongside the existing SD infrastructure such as ControlNet, IPAdapter, and the like? Obviously we will have to retrain these to fit onto your model, but conceptually, does your model have natural places where adapters of various kinds can be attached?
regarding ControlNet: we have a UNet backbone, so the idea of "make trainable copies of the encoder blocks" sounds possible. the other part, "use a zero-inited dense layer to project the peer-encoder output and add it to the frozen-decoder output" also sounds fine. not quite sure what they do with the mid-block but I doubt there'd be any problem there.

regarding IPAdapter: I'm not familiar with it, but from the code it looks like they just run cross-attention again and sum the two attention outputs. feels a bit weird to me, because the attention probabilities add up to 2 instead of 1. and they scale the bonus attention output only instead of lerping. it'd make more sense to me to formulate it as a cross-cross attention (Q against cat([key0, key1]) and cat([val0, val1])), but maybe they wanted it to begin as a no-op at the start of training or something. anyway.. yes, all of that should work fine with HDiT. the paper doesn't implement cross-attention, but it can be added in the standard way (e.g. like stable-diffusion) or as self-cross attention (e.g. DeepFloyd IF or Imagen).

I'd recommend though to make use of HDiT's mapping network. in our attention blocks, the input gets AdaNormed against the condition from the mapping network. this is currently used to convey stuff like class conditions, Karras augmentation conditions and timestep embeddings. but it supports conditioning on custom (single-token) conditions of your choosing. so you could use this to condition on an image embed (this would give you the same image-conditioning control as IPAdapter but via a simpler mechanism).

IPAdapter, I am curious if there are useful GUIs for this? Creating image masks through uploading to colab is not so cute.
Here's one example: https://github.com/Acly/krita-ai-diffusion/

But generally, most other UIs support it. It has serious limitations though, for example it center-crops the input to 224x224px. (which is enough for a surprisingly large amount of uses, but not enough for many others)

Yes. I discussed this issue with the author of the ComfyUI IP-Adapter nodes. It would doubtless be handy if someone could end-to-end train a higher resolution IP-Adapter model that integrated its own variant of CLIPVision that is not subject to the 224px constraint. I have no idea what kind of horsepower would be required for that.

A latent space CLIPVision model would be cool too. Presumably you could leverage the semantic richness of the latent space to efficiently train a more powerful CLIPVision. I don’t know whether anyone has tried this. Maybe there is a good reason for that.

I appreciate the restraint of showing the speedup on a log-scale chart rather than trying to show a 99% speed up any other way.

I see your headline speed comparison is to "Pixel-space DiT-B/4" - but how does your model compare to the likes of SDXL? I gather they spent $$$$$$ on training etc, so I'd understand if direct comparisons don't make sense.

And do you have any results on things that are traditionally challenging for generative AI, like clocks and mirrors?

Alex - I run Invoke (one of the popular OSS SD UIs for pros)

Thanks for your work - it’s been impactful since the early days of the project.

Excited to see where we get to this year.

ah, originally lstein/stable-diffusion? yeah that was an important fork for us Mac users in the early days. I have to confess I've still never used a UI. :)

this year I'm hoping for efficiency and small models! even if it's proprietary. if our work can reduce some energy usage behind closed doors that'd still be a good outcome.

Yes, indeed. Lincoln's still an active maintainer.

Energy efficiency is key - Especially with some of these extremely inefficient (wasteful, even) features like real-time canvas.

Good luck - Let us know if/how we can help.

Did you do any inpainting experiments? I can imagine a pixel-space diffusion model to be better at it than one with a latent auto-encoder.
Not yet, we focused on the architecture for this paper. I totally agree with you though - pixel space is generally less limiting than a latent space for diffusion, so we would expect good performance inpainting behavior and other editing tasks.