|
|
|
|
|
by devadvance
1106 days ago
|
|
From the paper: > In this work, we present the first text-to-image diffusion model that generates an image on mobile
devices in less than 2 seconds. To achieve this, we mainly focus on improving the slow inference speed
of the UNet and reducing the number of necessary denoising steps. As a layman, it's impressive and surprising that there's so much room for optimization here, given the number of hands on folks in the OSS space. > We propose a novel evolving training framework to obtain an efficient UNet that performs better
than the original Stable Diffusion v1.52 while being significantly faster. We also introduce a data
distillation pipeline to compress and accelerate the image decoder. Pretty impressive. |
|
There's only so many folks in OSS space that are capable of doing work from this angle. There are more who could be micro-optimizing code, but the most end up developing GUIs and app prototypes and ad-hoc Python scripts that use the models.
At the same time, the whole field moves at ridiculously fast pace. There's room for optimization because the new model generations are released pretty much as fast as they're developed and trained, without stopping to tune or optimize them.
Also, there must be room for optimization given how ridiculously compute-expensive training and inference still is. Part of my intuition here is that current models do roughly similar things to what our brains do, and brains manage to do these things fast with some 20-50 watts. Sure, there are a lot of differences between NN models and biological brains, but to a first approximation, this is a good lower bound.