Y
Hacker News
new
|
ask
|
show
|
jobs
by
darknoon
641 days ago
this is somewhat similar, but diffusion transformers typically use a pre-trained text model as the text conditioning whereas, in this case it's integrated and trained together multimodally.