Hacker News new | ask | show | jobs
by CornCobs 1740 days ago
I'm working on a similar domain, music transcription. The challenge is to estimate note values (how many beats is a note supposed to be as represented in the score?) and I'm not sure what would be the a good way to measure transcription accuracy. The naive note error rate cannot capture whether my model successfully detects certain musical structure, syncopation, dotted rhythms etc
4 comments

Related, are there better representations for music than standard notation (or MIDI)?

I'm wondering what the higher convolution levels could look like, if this was a CNN analyzing an image. Something between a the complete Ableton/Logic export and a MIDI file. Being able to capture the "feel" of a song (or a section within a song) strikes me as an important milestone towards designing really good generative music.

Maybe some kind of alignment metric, to measure how far off on timing notes tend to be?

I can also imagine a generalized "local error rate" which measures how far away errors tend to be from each other. If errors tend to be clustered, I would guess that's showing inability to follow some musical pattern. I think you'd want errors to appear randomly distributed rather than clustered. (This metric might make sense for speech too)

Hey, you should drop me an email (info in my HN bio). This is a passion of mine and I'm always up for chatting about it.
Sure thing!
You might want to consider comparing generated sound files, rather than abstract notion. If you have the ground truth notion, render that using the same mechanism as your transcription. Then you can use various spectral comparison techniques on the sound, including things like fourier analysis to compare structure.