Hacker News new | ask | show | jobs
by darknoon 641 days ago
this is somewhat similar, but diffusion transformers typically use a pre-trained text model as the text conditioning whereas, in this case it's integrated and trained together multimodally.