Hacker News new | ask | show | jobs
by mrtksn 1298 days ago
With the full 50 iterations it appears to be about 30s on M1.

They have some benchmarks on the github repo: https://github.com/apple/ml-stable-diffusion

For reference, previously I was getting about <3 minutes for 50 iterations on my Macbook Air M1. I haven't yet tried Apple's implementation but it looks like a huge improvement. It might take it from "possible" to "usable".

4 comments

Yeah, it is just PyTorch MPS backend is not fully baked and have some slowness. You should be able to get close to that number with maple-diffusion (probably 10% slower) or my app: https://drawthings.ai/ (probably around 20% slower, but it supports samplers that takes less steps (50 -> 30)).
For comparison, it's also taking ~3min @ 50 iterations on my 12c Threadripper using OpenVino. It sounds like the improvements bring the M1 performance roughly in line with a GTX 1080.
The Apple Neural Engine in the m1 is supposed to be able to perform 11 tops. The GTX 1080 about 9-11 tflops.

So sounds plausible that the m1 can reach the same level in some use cases with the right optimizations.

I have Macbook Air M1, which is passively cooled. When cooled properly, that is thermal pad mod combined with a fan under the laptop, I'm getting closer to 2min - something like 2.8s per iteration. I guess it would be something 140s for 50 iterations on a MacBook Pro or Mac mini for M1.
This is accurate re: M1 Mac Mini times IME
Not SD2.0 but SD1.5, I am getting 30 iterations in 10 seconds on 1080ti. 50 iterations 18 seconds. 100%|| 30/30 [00:10<00:00, 2.84it/s]
How do dreamstudio/craiyon/hugging face manage to do seemingly quicker on their interfaces? Are they hosting these models on super beefy and costly GPUs for free?
M1's single-threaded CPU performance and power efficiency are exceptional; however M1's GPU performance is nothing special compared to normal discrete GPUs. You don't need something super beefy to beat M1 on the GPU side.

But also yes, it's gotta be expensive to host these models and I'm not sure where all these subsidies are coming from. I expect that we'll eventually see these things transition to more paid services.

For a low-power SoC, the GPU performance is actually pretty impressive. We recently did some transformer benchmarks and the inference performance of the M1 Max is almost half that of an RTX3090:

https://explosion.ai/blog/metal-performance-shaders

However the SoC only uses 31W when posting that performance.

Haven't tried this yet, but sounds slower than SD itself if you use one of the alt builds that supports mps where it had been cuda.

Mac Studio with M1 Ultra gets 3.3 iters/sec for me.

MacBook Pro M1 Max gets 2.8 iters/sec for me.

You’re talking about the higher end SKUs with many more GPU cores though and significantly more RAM (I think the lowest you can get is 32GB vs the 8 on their chip)