As for doing it in general, it's a fairly standard vision transformer so anything built on DINOv2 (or any other ViT) should be easy to adapt to v3.