Hacker News new | ask | show | jobs
by weebull 862 days ago
WTF!

On my 5900X, so 12 cores, I was able to get SDXL to around 10-15 minutes. I did do a few things to get to that.

1. I used an AMD Zen optimised BLAS library. In particular the AMDBLIS one, although it wasn't that different to the Intel MKL one.

2. I preload the jemalloc library to get better aligned memory allocations.

3. I manually set the number of threads to 12.

This is the start of my ComfyUI CPU invocation script.

    export OMP_NUM_THREADS=12
    export LD_PRELOAD=/opt/aocl/4.1.0/aocc/lib_LP64/libblis-mt.so:$LD_PRELOAD
    export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
    export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
Honestly, 12 threads wasn't much better than 8, and more than 12 was detrimental. I was memory bandwidth limited I think, not compute.