Stable Diffusion on a 16 core AMD CPU takes for me about 2-3 hours to generate an image, just to give you a rough idea of the performance. (On the same AMD's iGPU it takes 2 minutes or so).
For example I pulled a (2GB I think, 4 tops) 6870 out of my desktop because it's a beast (in physical size, and power consumption) and I wasn't using it for gaming or anything, figured I'd be fine just with the Intel integrated graphics. But if I wanted to play around with some models locally, it'd be worth putting it back & figuring out how to use it as a secondary card?
One counterintuitive advantage of the integrated GPU is it has access to system RAM (instead of using a dedicated and fixed amount of VRAM). That means I'm able to give the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when running. The system RAM is slower than VRAM which is the trade-off here.
2GB is really low. I've been able to use A111 stable diffusion on my old gaming laptop's 1060 (6GB VRAM) and it takes a little bit less than a minute to generate an image. You would probably need to try the --lowvram flag on startup.
SDXL Turbo is much better, albeit kinda fuzzy and distorted. I was able to get decent single-sample response times (~80-100s) from my 4 core ARM Ampere instance, good enough for a Discord bot with friends.
If that is true, then the CPU variant must be a much worse implementation of the algorithm than the GPU variant, because the true ratio of the GPU and CPU performances is many times less than that.
On my 5900X, so 12 cores, I was able to get SDXL to around 10-15 minutes. I did do a few things to get to that.
1. I used an AMD Zen optimised BLAS library. In particular the AMDBLIS one, although it wasn't that different to the Intel MKL one.
2. I preload the jemalloc library to get better aligned memory allocations.
3. I manually set the number of threads to 12.
This is the start of my ComfyUI CPU invocation script.
Honestly, 12 threads wasn't much better than 8, and more than 12 was detrimental. I was memory bandwidth limited I think, not compute.