From a quick glance, it appears that it's because the implementation pushes the entire graph (all edges) to the GPU. Sampling of edges during training could alleviate this.
Indeed, TensorFlow likes pushing everything to the GPU by default whereas many PyTorch DL implementations encourage feeding data from the CPU to the GPU as needed with a DataLoader.
There have been attempts at a PyTorch port of Parametric UMAP (https://github.com/lmcinnes/umap/issues/580) but nothing as good.