|
This linear scaling per-core doesn't match my experience with using the AMX co-processor. In my experience, on the M1 and the M1 Pro, there are a limited number of AMX co-processors that is independent of the number of cores within the M1 chip. I wrote an SO answer exploring some of the performance implications of this [0], and since I wrote that another more knowledgable poster has added more information [1]. One of the key takeaways is that appears to be one AMX coprocessor per "complex", leading us to hypothesize that the M1 Pro contains 2 AMX co-processors. This is supported by taking the code found in the gist [2] linked from my SO answer and running it on my M1 Pro. Compiling it, we get `dgesv_accelerate` which uses Accelerate to solve a medium-size linear algebra problem, that typically takes ~8s to finish on my M1 Pro. While running, `htop` reports that the process is pegging two cores (analogous to the result in my original SO answer on the M1 pegging one core; this supports the idea that the M1 Pro contains two AMX co-processors). If we run two `dgesv_accelerate` processes in parallel, we see that they take ~15 seconds to finish. So there is some small speedup, but it's very small. And if we run four processes in parallel, we see that they take ~32 seconds to finish. All in all, the kind of linear scaling shown in the article doesn't map well to the limited number of AMX co-processors available in Apple hardware, as we would expect the M1 Max to contain maybe 8 co-processors at most. This means we should see parallelism step up in 8 steps, rather than 20 steps as was shown in the graph. Everything I just said is true assuming that a single processor, running well-optimized code can completely saturate an AMX co-processor. That is consistent with the tests that I've run, and I'm assuming that the CFD solver he's running is well-written and making good use of the hardware (it does seem to be doing so from the shape of his graphs!). If this were not the case, one could argue that increasing the number of threads could allow multiple threads to more effectively share the underlying AMX coprocessor and we could get the kind of scaling seen in the article. However, in my experiments, I have found that Accelerate very nicely saturates the AMX resources and there is none left over for future sharing (as shown in the dgesv examples). Finally, as a last note on performance, we have found that using OpenBLAS to run numerical workloads directly on the Performance cores (and not using the AMX instructions at all) is competitive on larger linear algebra workloads. So it's not too crazy to assume that these results are independent of the AMX's abilities! [0] https://stackoverflow.com/a/67590869/230778
[1] https://stackoverflow.com/a/69459361/230778
[2] https://gist.github.com/staticfloat/2ca67593a92f77b1568c03ea... |