| HN Mirror

Short answer is yes! Previous work has shown that we can obtain very good results from controlling DDSP models from midi input. The solutions I am familiar with employ a two stage approach where the first stage takes midi and turns it into control signals (pitch & loudness contours etc..) and the second stage turns the controls signals into audio (like the particular model I discuss in the blog post)[1][2][3]. I actually think that the first stage could also benefit from the transfer learning techniques we discuss in the blogpost.

In terms of actually releasing a MIDI playable VST plugin I believe that Magenta have something like it in the works[4]. I hope that it will come with some ability for users to quickly create their own instruments, presumably using a transfer learning technique similar to the one we have presented.

Real-time rendering poses multiple challenges. For one, some instrument sounds occur before a note properly onsets (for example the sound of the fingers pressing the keys of a saxophone occurs before the first note of the piece). Secondly, the research models are quite heavy and considerably more compute intensive than a standard VST instrument which poses a problem if you want to use it inside a DAW. I think this latter problem can be solved with some clever engineering and the general trend of hardware being more and more accommodating to machine learning applications.

[1] https://erl-j.github.io/controlsynthesis/#/ (Our previous work) [2] https://rodrigo-castellon.github.io/midi2params/ (Focuses on realtime rendering) [3] https://arxiv.org/abs/2112.09312 (Magenta's recent paper on the subject)