Both of these approaches lose to a CPU. The state of the art algorithm is Hashlife [1], which compresses both time and space, and can evaluate billions of generations on a grid of trillions cells in milliseconds.
The GPA approach is really efficient at what it does but ultimately it doesn't scale. For one, it needs 1 bit per cell in the 2D torus, but FPGA have kilobytes or low-megabytes amounts of memory. That makes it hard to simulate a 10,000 x 10,000 grid, let alone a 1,000,000 x 1,000,000 grid. For two, the FPGA explicitly calculates each iteration one-by-one. This is pretty fast in the beginning, and it means you can use it to calculate a billion iterations in a few seconds or a trillion iterations in a few hours, but you can't scale past that.
Hashlife can probably be sped up by GPUs a bit, but it processes a symbolic representation and consequently is quite suited to CPUs. It spends a lot of its time doing hash table lookups (hence the name) which is not a good fit for GPUs and a terrible fit for FPGAs.
This reminds me of how I was fascinated by N-body simulations and fractals in high school, and then later found out there are much better ways of calculating both gravity and the Mandelbrot set than the obvious ones.
(i.e. tree methods like Barnes-Hut for gravity and perturbation theory for Mandelbrot)
GPU L0 cache latency IIUC is ~20x higher than CPU. In fact in this case I think GPU would have to use L2 cache since the data is shared across so many cores, so now you're talking ~50x. So even if you get full parallelism of cell computation you can plug in the numbers and find it would be far slower than FPGA (but still faster than CPU).
I'm not an expert though. Maybe GPUs have some way of mitigating the high cache latencies.
The main trick GPUs use is having a massive amount of hardware threads per actual core. If an instruction in one thread is stalled on a load operation, the core will just switch to another thread. If you have more runnable threads at all times than your memory latency in cycles, the latency will not affect your throughput anymore.
Considering that the cell of each next generation can be individually calculated in parallel, I don't think a GPU implementation would be able to beat it. A GPU can have many pipelines and quickly process many "pixels" simultaneously but it will only be able to parallelize all of them for very small screen sizes.
The GPA approach is really efficient at what it does but ultimately it doesn't scale. For one, it needs 1 bit per cell in the 2D torus, but FPGA have kilobytes or low-megabytes amounts of memory. That makes it hard to simulate a 10,000 x 10,000 grid, let alone a 1,000,000 x 1,000,000 grid. For two, the FPGA explicitly calculates each iteration one-by-one. This is pretty fast in the beginning, and it means you can use it to calculate a billion iterations in a few seconds or a trillion iterations in a few hours, but you can't scale past that.
Hashlife can probably be sped up by GPUs a bit, but it processes a symbolic representation and consequently is quite suited to CPUs. It spends a lot of its time doing hash table lookups (hence the name) which is not a good fit for GPUs and a terrible fit for FPGAs.
1. https://en.wikipedia.org/wiki/Hashlife