|
|
|
|
|
by weebull
862 days ago
|
|
WTF! On my 5900X, so 12 cores, I was able to get SDXL to around 10-15 minutes. I did do a few things to get to that. 1. I used an AMD Zen optimised BLAS library. In particular the AMDBLIS one, although it wasn't that different to the Intel MKL one. 2. I preload the jemalloc library to get better aligned memory allocations. 3. I manually set the number of threads to 12. This is the start of my ComfyUI CPU invocation script. export OMP_NUM_THREADS=12
export LD_PRELOAD=/opt/aocl/4.1.0/aocc/lib_LP64/libblis-mt.so:$LD_PRELOAD
export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
Honestly, 12 threads wasn't much better than 8, and more than 12 was detrimental. I was memory bandwidth limited I think, not compute. |
|