| TLDR : 1B particles ~ 3s per iterations For examples like particle simulations, on a single node with a 4090 GPU everything running on GPU without memory transfer to the CPU: -The main bottleneck is memory usage : available 24GB, Storing the particles 3 position coordinates, + 3 velocity coordinates, 4 bytes by number (float32) = Max 1B particles -Then GPU memory bandwidth : if everything is on the GPU you get between 1000GB/s of global memory access and 10000GB/s when shared memory caches are hit. The number of memory access is roughly proportional to the number of effective collisions between your particles which is proportional to the number of particles so around 12-30 times ( see optimal sphere packing number of neighbors in 3d, and multiply by your overlap factor). All in all for 1B particles, you can collision them all and move them in 1 to 10s. If you have to transfer things to the CPU, you become limited by the PCI-express 4.0 bandwidth of 16GB/s. So you can at most move 1B particles to and from the GPU, 0.7 times per second. Then if you want to store the particle on disk, instead of RAM because your system is bigger, then you can either use a M2 ssd (but you will burn them quickly) which has a theoretical bandwidth of 20GB/s so not a bottleneck, or use a network storage over 100Gb/s (= 12.5GB/s) ethernet, via two interfaces to your parameter server which can be as big as you can afford. So to summarize so far : 1B particles takes 1 to 10s per iteration per GPU. If you want to do smarter integration schemes like Rk4, you divide by 6. If you need 64 bits precisions you divide by 2. If you only need 16bits precisions you can multiply by 2. The number of particle you need : Volume of the box / h^3 with h the diameter of the particle = finest details you want to be able to resolve. If you use an adaptive scheme most of your particles are close to the surface of objects so O( surface of objects / h^2 ) with h=average resolution of the surface of the mesh. But adaptive scheme is 10 times slower. The precision of the approximation can be bounded by Taylor formula. SPH is typically order 2, but has issues with boundaries, so to represent a sharp boundary the h must be small. If you want higher order and sharp boundaries, you can do Finite Element Method, instead. But you'll need to tessellate the space with things like Delaunay/Voronoi, and update them as they move. |