They explain in the paper that they used 1.5 million images with known depth maps (labels) to train a teacher model, and then used the teacher model to create pseudolabels (inferred depth maps) for the full dataset. Then they trained a student model to recover those pseudolabels from distorted versions of the original images.