Short answer is yes! Previous work has shown that we can obtain very good results from controlling DDSP models from midi input. The solutions I am familiar with employ a two stage approach where the first stage takes midi and turns it into control signals (pitch & loudness contours etc..) and the second stage turns the controls signals into audio (like the particular model I discuss in the blog post)[1][2][3]. I actually think that the first stage could also benefit from the transfer learning techniques we discuss in the blogpost.
In terms of actually releasing a MIDI playable VST plugin I believe that Magenta have something like it in the works[4]. I hope that it will come with some ability for users to quickly create their own instruments, presumably using a transfer learning technique similar to the one we have presented.
Real-time rendering poses multiple challenges. For one, some instrument sounds occur before a note properly onsets (for example the sound of the fingers pressing the keys of a saxophone occurs before the first note of the piece). Secondly, the research models are quite heavy and considerably more compute intensive than a standard VST instrument which poses a problem if you want to use it inside a DAW. I think this latter problem can be solved with some clever engineering and the general trend of hardware being more and more accommodating to machine learning applications.
In terms of actually releasing a MIDI playable VST plugin I believe that Magenta have something like it in the works[4]. I hope that it will come with some ability for users to quickly create their own instruments, presumably using a transfer learning technique similar to the one we have presented.
Real-time rendering poses multiple challenges. For one, some instrument sounds occur before a note properly onsets (for example the sound of the fingers pressing the keys of a saxophone occurs before the first note of the piece). Secondly, the research models are quite heavy and considerably more compute intensive than a standard VST instrument which poses a problem if you want to use it inside a DAW. I think this latter problem can be solved with some clever engineering and the general trend of hardware being more and more accommodating to machine learning applications.
[1] https://erl-j.github.io/controlsynthesis/#/ (Our previous work) [2] https://rodrigo-castellon.github.io/midi2params/ (Focuses on realtime rendering) [3] https://arxiv.org/abs/2112.09312 (Magenta's recent paper on the subject)