Is 3d a different problem, or a similar one but considerably harder? I'd expect the data encoding (vertices vs pixels) to change a bit about it but I'm not familiar enough to know.
Pixel values are discrete (length x width x r256 x g256 x b256) and vertex values are continuous, so that is one major difference.
Secondly, there's vastly more labeled image data in the world than 3D data, so creating a CLMP (contrastive language and mesh pairing) model is harder.
It's very late but I may be able to give a much better answer on more of the nuances of 3D generation tomorrow.
The “hot new thing” is NeRF, neural radiance fields, which can take into account the way light interacts with the object (and hence you can correlate data from pictures taken at different angles)
Secondly, there's vastly more labeled image data in the world than 3D data, so creating a CLMP (contrastive language and mesh pairing) model is harder.
It's very late but I may be able to give a much better answer on more of the nuances of 3D generation tomorrow.