Definitely curious, this looks very similar to Coconut, even down to the CoT encoding process in Figure 2. They go into a lot more detail though, seems like parallel innovation.
I wonder whether even those models which emit thinking tokens in reality do most of the work within the latent space so the difference is only superficial