If I understand correctly, a big problem is that the calculation isn't embarrasingly parallel: the various chunks are not independent, so you need to do a lot of IO to get the results from step N from your neighbours to calculate step N+1.
Using more smaller nodes means your cross-node IO is going to explode. You might save money on your compute hardware, but I wouldn't be surprised if you'd end up with an even greater cost increase on the network hardware side.
I'm not sure where else you can get a half TB of 800GB/s memory for < $10k. (Though that's the M3 Ultra, don't know about the M5). Is there something competitive in the nvidia ecosystem?
I wasn't aware that M3 Ultra offered a half terabyte of unified memory, but an RTX5090 has double that bandwidth and that's before we even get into B200 (~8TB/s).
You could get x1 M3 Ultra w/ 512gb of unified ram for the price of x2 RTX 5090 totaling 64gb of vram not including the cost of a rig capable of utilizing x2 RTX 5090.