methodology is a separate u-net per instrument type to predict a soft mask in spectrogram space (time x frequency), then they apply that mask to the input audio. fairly standard.