| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by darknoon 688 days ago
	this is somewhat similar, but diffusion transformers typically use a pre-trained text model as the text conditioning whereas, in this case it's integrated and trained together multimodally.