| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Birch-san 881 days ago

regarding ControlNet: we have a UNet backbone, so the idea of "make trainable copies of the encoder blocks" sounds possible. the other part, "use a zero-inited dense layer to project the peer-encoder output and add it to the frozen-decoder output" also sounds fine. not quite sure what they do with the mid-block but I doubt there'd be any problem there.

regarding IPAdapter: I'm not familiar with it, but from the code it looks like they just run cross-attention again and sum the two attention outputs. feels a bit weird to me, because the attention probabilities add up to 2 instead of 1. and they scale the bonus attention output only instead of lerping. it'd make more sense to me to formulate it as a cross-cross attention (Q against cat([key0, key1]) and cat([val0, val1])), but maybe they wanted it to begin as a no-op at the start of training or something. anyway.. yes, all of that should work fine with HDiT. the paper doesn't implement cross-attention, but it can be added in the standard way (e.g. like stable-diffusion) or as self-cross attention (e.g. DeepFloyd IF or Imagen).

I'd recommend though to make use of HDiT's mapping network. in our attention blocks, the input gets AdaNormed against the condition from the mapping network. this is currently used to convey stuff like class conditions, Karras augmentation conditions and timestep embeddings. but it supports conditioning on custom (single-token) conditions of your choosing. so you could use this to condition on an image embed (this would give you the same image-conditioning control as IPAdapter but via a simpler mechanism).