Hacker News new | ask | show | jobs
by cnity 1656 days ago
I've worked on this problem for some time on a personal project, and I'm pretty convinced you can basically solve this problem without deep learning or AI techniques, and instead use non-negative matrix factorization[0] as a bank of note templates (from their spectrograms). I have a fairly well working proof of concept and the approach is supported by the literature.

[0]: https://en.wikipedia.org/wiki/Non-negative_matrix_factorizat...

_edit_

That said, you'd probably need something more hard-core for extraction from an actual track, so you're probably right.

3 comments

As an amateur musician I see this as a missing holy grail killer app. I'd love to have it just to pull the chords out of some of my own old recordings that I can't figure out how to repeat.

If anyone knows of any apps (even prototypes) that can do this, please provide links.

There are various commercial offerings that do this, including Chordify and Capo.

Alternatively if you want a flexible free (GPLv2) option with source code, consider the Chordino plugin (http://www.isophonics.net/nnls-chroma, C++ code at https://github.com/c4dm/nnls-chroma) which you can run in Sonic Visualiser (https://www.sonicvisualiser.org/), or indeed in Audacity.

There is a reference describing the Chordino method in the page linked above, but roughly it's not too far from the description of the parent poster - a non-negative least-squares method produces a frame-by-frame semitone-scaled decomposition which is then matched against templates and turned into a chord sequence using a hidden Markov model. Some sort of intelligent smoothing like this is certainly needed, the raw template matching is not especially useful on its own.

This type of method is now routinely outperformed by neural networks (see e.g. these MIREX evaluation results from 2018 which compare Chordino with a few other academic methods https://www.music-ir.org/mirex/wiki/2018:Audio_Chord_Estimat...) but I would suggest that it's still good enough to be useful - and I encourage the parent to continue their work, as it's an interesting area to explore.

Success! Per your suggestion, I used Chordino plugin with Sonic Visualizer. I pumped in one of my old recordings and it showed me "B6" and "Dmaj7".

However, this was only so helpful. It would have been better if it showed me a fretboard and lit up which notes are active. Instead it just showed me e.g. "B6" which unfortunately has multiple implementations so I had to try many of the implementations across many capo positions, and it ended up being a "B6" that I don't even see documented in any chord guides, probably because I was using a capo. I was eventually able to find it by guess and check: randomly moving my fingers and capo around the fretboard then if it sounds close look it up to make sure it's "B6" in a reverse chord finder (i.e. oolimo.de). Still pretty painful for amateur me.

Perhaps the fact that a given chord has multiple implementations makes it impossible for the analyzer to know which one I'm using, but in my case I'm strumming all 6 strings so I suspect it's doable. Do you know if any tools can show the results as dots-on-fretboard? Or maybe I need a more thorough reverse chord finder?

Ableton Live has a Audio to Midi [1] feature that works pretty well to extract notes from audio. Melodyne [2] is a very powerful tool to mess around with notes in general.

There's also some open source projects out there, such as aubio [3] and Omnizart [4].

[1] https://www.ableton.com/en/manual/converting-audio-to-midi/

[2] https://www.celemony.com/en/melodyne/what-is-melodyne

[3] https://aubio.readthedocs.io/en/latest/cli.html#aubionotes

[4] https://music-and-culture-technology-lab.github.io/omnizart-...

Automated music transcription - the process of generating a musical score from an audio recording - is a pretty active topic in signal processing (and has been for a couple of decades at least).

The monophonic case (one note at a time) is fairly well solved at this point: there are decades old solutions in both the frequency domain (like FFT) and time domain (like auto-correlation and the dozens of refinements of that basic concept) that work quite well under less than ideal conditions and in near real-time. Even naive solution like just counting the number of zero-crossings in the audio signal to estimate the fundamental frequency works pretty well.

The polyphonic case (like chord detection) is trickier, especially depending on what you're looking for exactly. I.e., is it sufficient to say "that's a C Major chord" or are you looking for a specific inversion or even fingering? Does it need to happen in real-time based off a microphone or could you batch-process an audio file instead?

But there are both academic solutions and consumer-oriented tools that can do a reasonable job of it (again, depending on what you're looking for).

If you're looking for guitar-chord detection in particular, I'd recommend you take a look at Chordify (https://chordify.net/). I'm even the developer of a product that competes with (or at least overlaps with) Chordify, but frankly it pretty much does what it says on the tin (extracts chords from audio recordings with more than acceptable fidelity, especially if you're willing and able to refine that by ear using the automated transcription as a starting point).

I'm pretty sure Chorify's solution is based on "deep learning" (ANN) techniques, but others have noted in this thread that's not the only viable way to do it. I suspect some combination of increasing computational power and algorithmic refinements will eventually lead to a "direct analysis" approach that becomes as common/conventional for polyphonic pitch detection as FFT and AC are for the monophonic case. There are already a number of fairly effective techniques depending on the constraints you want to put on the problem.

> The polyphonic case (like chord detection) is trickier

Very true, but for practical purposes chord detection is easier than polyphonic note transcription - it isn't necessary to transcribe all the notes with perfect fidelity to identify a likely chord, and there are many issues around note timing that become simpler when you assume one chord at a time.

> I'm pretty sure Chorify's solution is based on "deep learning" (ANN) techniques

At least at launch, I believe they were using a method more like that of Chordino - in fact using the same chromagram decomposition - but with a more sophisticated language model for chord transitions than Chordino's HMM.

(See this publication from one of Chordify's founders, et al, which I think is relevant, or at least interesting http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.294...)

I may have to blow the dust off of my prototype.
If you do, make sure to make it a Show HN to let us know!
This is the right approach for song recommendation too — try your approach there and see what happens. If you want help on the business side, reach out.
Can you expand on what you mean by "song recommendations" in this context? Do you mean recommendations like "if you like X you might also like Y"?

Assuming the answer is "yes", I'm not sure if I follow the leap from "non-negative matrix factorization as a bank of note templates (from their spectrograms)" to "song recommendations".

Very loosely speaking my (not wholly uninformed) interpretation of the "note templates" bit is sorta analogous to DFFT analysis with frequency bins centered around the "regular" notes of the chromatic scale - i.e., the cells of the matrix represent signal strength for the frequencies that correspond to conventional western notes (e.g. integer-valued midi note numbers) or (in aggregated form) octave-independent pitch-classes rather than arbitrary frequencies (that might fall between two conventional notes). It's a very useful representation of the component frequencies that appear in conventional music but at the end of the day it's more or less "here are the notes (or pitch classes) active for this given beat".

That is, I can imagine the role this information - essentially the musical score itself, or at least the pitch-specific dimension of that score - could play in a song-recommendation engine, but I'm curious how/why that specific spectrogram-template-based representation is significant.

Are you suggesting that that specific representation could be applied to song recommendations in a way that similar polyphonic-pitch-per-beat information derived from some other algorithm (FFT analysis for a hypothetical example) could not?

Or maybe I've misinterpreted your comment entirely?

More likely, they mean using non-negative matrix factorization but with a bank of feature vectors instead of note templates. NNMF can be used in a wide variety of domains because it essentially encodes the problem of "this thing is a bit like this thing, a bit like this thing, and a bit like this other thing".

If instead of numbers representing intensity at different frequencies (as in the spectrograms), the numbers in each vector of the template bank represent other features (such as listener overlap with other artists/songs, or genre representation across multiple continuous "color" axes) then you can recommend music to a listener based on the similarity to songs in their library to those in the template bank.

Ack'd. That makes more sense. I guess I took the comment too literally.
That is very kind of you to offer. I don't know if I'd be able to find the time to dedicate to having this function as an actual product though!
You're already leaps and bounds into a good partnership - realistic expectations. Consider it again when you get downtime, no rush over here, more fun.
I've also spent a fair bit of time on this topic and for what it's worth I agree with you. It is a harder problem than the monophonic case (and more sensitive to problems like noise under real-world conditions) but you don't strictly need deep learning or AI techniques to solve it.

I mean, computational complexity aside it seems like at least hypothetically you could even just apply basic auto-correlation-style logic to detect the period of the combined wave much like you do in the monophonic case (assuming the chord is sustained for long enough to actually capture that full period, which of course it won't be in the general case). There's nothing magical about a neural-net or other deep-learning-style solution to this problem - at the end of the day that's just an approximation of a formula that could in theory be derived through more direct means anyway. And (as far as I know) there's no reason to believe the polyphonic case is fundamentally resistant to more traditional techniques.

And as implied by your comment, the problem is made easier (or at least less resource-intensive) in practice than it is in the abstract: we're mostly interested in audio that's comprised of actual notes from the chromatic scale (rather than a combination of arbitrary frequencies). There's only ~140 or so component frequencies we really need to consider in practice. (Not to mention the semi-predictable repetition/progression patterns you're likely to encounter in most conventional songs. That's inadequate by itself but certainly a good way to error correct, fill in gaps, resolve ambiguous cases, etc.)

But that said, it does seem like polyphonic pitch detection is a problem that responds really well to machine-learning techniques. In my experience, even a fairly simplistic ANN (e.g., no hidden layers, ~1k to ~10k weights depending upon how the inputs/outputs are modeled) - when seeded with a little bit of domain-specific knowledge - can very quickly learn to perform reliable polyphonic pitch detection under real-world conditions.

To be fair, I haven't quite put my money where my mouth is on this topic (yet): I develop software that includes this sort of functionality and the current production version uses more conventional (or at least direct) analysis rather than so called "deep learning" techniques for polyphonic pitch detection. There are pros and cons to either approach, but I can definitely see why some find the deep learning solution attractive. There's probably some degree of magical thinking involved (i.e., "AI will solve this pattern recognition problem that's too hard for me to work out from first principles"), but it also seems to work really well in this case.

For what it's worth I think you've got the right general idea, or at least (based on your brief description) I think I arrived at a solution that's based on some similar concepts and found it fairly effective (beyond the proof-of-concept phase). And as you noted there are related concepts discussed in some of the published academic research. I'd love to hear a little more about your approach if you're willing and able to share any more details. (Noting that at least part of my interest in that topic is selfish, of course.)