Currently, I hit the limit of maximum workgroups amount for one submit dispatch (this is why y and z axis are lower than x one for now). It can be removed by adding multiple dispatches to the code, which I will do in one of the next updates. To go past 2^24 I need to polish the four stage FFT algorithm to allow for >2 data transfers, which I have implemented, but not yet tested. There will also be a single precision limit in this range, as the twiddle factors values will be close to 1e-8 which will be close to a machine error.