What are your thoughts on PESTO which learns pitch-prediction very well with a small network, and uses a self-supervised objective?
https://arxiv.org/abs/2309.02265
https://github.com/SonyCSLParis/pesto