| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rwmj 856 days ago
	Stable Diffusion on a 16 core AMD CPU takes for me about 2-3 hours to generate an image, just to give you a rough idea of the performance. (On the same AMD's iGPU it takes 2 minutes or so).

5 comments

weebull 855 days ago

WTF!

On my 5900X, so 12 cores, I was able to get SDXL to around 10-15 minutes. I did do a few things to get to that.

1. I used an AMD Zen optimised BLAS library. In particular the AMDBLIS one, although it wasn't that different to the Intel MKL one.

2. I preload the jemalloc library to get better aligned memory allocations.

3. I manually set the number of threads to 12.

This is the start of my ComfyUI CPU invocation script.

    export OMP_NUM_THREADS=12
    export LD_PRELOAD=/opt/aocl/4.1.0/aocc/lib_LP64/libblis-mt.so:$LD_PRELOAD
    export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
    export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"

Honestly, 12 threads wasn't much better than 8, and more than 12 was detrimental. I was memory bandwidth limited I think, not compute.

link

OJFord 856 days ago

Even older GPUs are worth using then I take it?

For example I pulled a (2GB I think, 4 tops) 6870 out of my desktop because it's a beast (in physical size, and power consumption) and I wasn't using it for gaming or anything, figured I'd be fine just with the Intel integrated graphics. But if I wanted to play around with some models locally, it'd be worth putting it back & figuring out how to use it as a secondary card?

link

rwmj 856 days ago

One counterintuitive advantage of the integrated GPU is it has access to system RAM (instead of using a dedicated and fixed amount of VRAM). That means I'm able to give the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when running. The system RAM is slower than VRAM which is the trade-off here.

link

OJFord 856 days ago

Yeah I did wonder about that as I typed, which is why I mentioned the low amount (by modern standards anyway) on the card. OK, thanks!

link

purpleflame1257 856 days ago

2GB is really low. I've been able to use A111 stable diffusion on my old gaming laptop's 1060 (6GB VRAM) and it takes a little bit less than a minute to generate an image. You would probably need to try the --lowvram flag on startup.

link

mat0 856 days ago

No, I don't think so. I think you would need more VRAM to start with.

link

smoldesu 856 days ago

SDXL Turbo is much better, albeit kinda fuzzy and distorted. I was able to get decent single-sample response times (~80-100s) from my 4 core ARM Ampere instance, good enough for a Discord bot with friends.

link

emadm 856 days ago

Sd turbo runs nicely on a m2 MacBook Air (as does stable lm 2!)

Much faster models will come

link

antman 855 days ago

Which AMD CPU/iGPU are these timings for?

link

rwmj 855 days ago

AMD Ryzen 9 7950X 16-Core Processor

The iGPU is gfx1036 (RDNA 2).

link

adrian_b 856 days ago

If that is true, then the CPU variant must be a much worse implementation of the algorithm than the GPU variant, because the true ratio of the GPU and CPU performances is many times less than that.

link