| HN Mirror

1. Yes, the core of the algorithm is the modular squaring. The squaring is similar to a multiplication, of course. In general, the fast multiplication is implemented via convolution, via FFTs which results in a N x log(N) time complexity of the multiplication.

What we do is modular squaring iterations:

x := x^2 mod M,

where M== 2^p - 1, i.e. M is the Mersenne number that we test.

Realize that working modulo 2^p - 1 means that 2^p == 1, which corresponds to a circular convolution of size p bits. We use the "Irrational Base Discrete Weighted Transform", IBDWT [1] introduced by Crandall/Fagin to turn this into a convolution of a convenient size N "words", where each word contains about 18bits, so Words ~= p/18. For example our prime of interest M52 was tested with a FFT of size 7.5M == 1024 * 15 * 512.

The FFT is a double precision (FP64) floating point FFT. Depending on the FFT size we can make use of about 18bits per FFT "word", where a "word" corresponds to one FP64 value.

Some tricks involved up to this point are: one FFT size halving and the modular reduction for free because of IBDWT. Another FFT size halving because turning the real input/output values into complex numbers in the FFT.

The FFT implementation that we found appropriate for GPUs is the "matrix FFT", which splits the FFT of size N=A*B into sub-FFTs of size A, one matrix multiplication with about A*B twiddle factors, and sub-FFTs of size B. In practice we split the FFT into three dimensions, e.g. for M52 we used: 7.5M == 1024 * 15 * 512.

We implement in a workgroup one FFT of size 1024 or 512. These are usually base-4 FFTs, with transpositions using LDS (Local Data Share, local per-workgroup memory in OpenCL).

The convolution is formed of:

   - forward FFT
   - element-wise multiplication
   - inverse FFT

After the inverse FFT, we also need to do Carry propagation which properly turns the convolution into a multi-word multiplication.

For performance we merge a few logical kernels that are invoked in succession into a single big kernel, where possible. The main advantage of doing so is that the data does not need to transit through "global memory" (VRAM) anymore but stays local to the workgroup, which is a large gain.

So, to recap:

   - multiplication via convolution
   - convolution via FP64 FFT, achieving about 18bits per FP64 word
   - modular reduction for free through IBDWT

[1] https://en.wikipedia.org/wiki/Irrational_base_discrete_weigh...*