| I've also spent a fair bit of time on this topic and for what it's worth I agree with you. It is a harder problem than the monophonic case (and more sensitive to problems like noise under real-world conditions) but you don't strictly need deep learning or AI techniques to solve it. I mean, computational complexity aside it seems like at least hypothetically you could even just apply basic auto-correlation-style logic to detect the period of the combined wave much like you do in the monophonic case (assuming the chord is sustained for long enough to actually capture that full period, which of course it won't be in the general case). There's nothing magical about a neural-net or other deep-learning-style solution to this problem - at the end of the day that's just an approximation of a formula that could in theory be derived through more direct means anyway. And (as far as I know) there's no reason to believe the polyphonic case is fundamentally resistant to more traditional techniques. And as implied by your comment, the problem is made easier (or at least less resource-intensive) in practice than it is in the abstract: we're mostly interested in audio that's comprised of actual notes from the chromatic scale (rather than a combination of arbitrary frequencies). There's only ~140 or so component frequencies we really need to consider in practice. (Not to mention the semi-predictable repetition/progression patterns you're likely to encounter in most conventional songs. That's inadequate by itself but certainly a good way to error correct, fill in gaps, resolve ambiguous cases, etc.) But that said, it does seem like polyphonic pitch detection is a problem that responds really well to machine-learning techniques. In my experience, even a fairly simplistic ANN (e.g., no hidden layers, ~1k to ~10k weights depending upon how the inputs/outputs are modeled) - when seeded with a little bit of domain-specific knowledge - can very quickly learn to perform reliable polyphonic pitch detection under real-world conditions. To be fair, I haven't quite put my money where my mouth is on this topic (yet): I develop software that includes this sort of functionality and the current production version uses more conventional (or at least direct) analysis rather than so called "deep learning" techniques for polyphonic pitch detection. There are pros and cons to either approach, but I can definitely see why some find the deep learning solution attractive. There's probably some degree of magical thinking involved (i.e., "AI will solve this pattern recognition problem that's too hard for me to work out from first principles"), but it also seems to work really well in this case. For what it's worth I think you've got the right general idea, or at least (based on your brief description) I think I arrived at a solution that's based on some similar concepts and found it fairly effective (beyond the proof-of-concept phase). And as you noted there are related concepts discussed in some of the published academic research. I'd love to hear a little more about your approach if you're willing and able to share any more details. (Noting that at least part of my interest in that topic is selfish, of course.) |