When you get used to this kind of high quality metadata, it's just so so sad to see how companies like Spotify treat metadata. As an example, look up Bob Marley & The Wailers on Spotify and try to find original releases, and then compare that to the list found here:
The amount of data is amazing – but I find that it's both a blessing and a curse. It absolutely excels at the use case of tagging audio files (as a lot of people here are noting), and as an encyclopedic reference (its purpose). For other software integration use cases, where there is any ambiguity involved whatsoever, a huge portion of code needs to be dedicated to deciding which recording/release/etc. is the likely intended or "canonical" entity. I find that the rankings from the search API are not nearly good enough for this.
Consider a recording search for "smells like teen spirit". Any human with cursory knowledge of pop music would point you at the Nirvana single from the 1991 release of Nevermind (in this particular case, it's likely even true in every locale). But MusicBrainz has no notion of popularity, common sense, or the real-world context of any of its entities, so the recording from Nevermind isn't even on the first page of results. Heck, the first result isn't even a Nirvana recording. The second result is from an obscure live bootleg album. In my opinion this should be considered a bug. This stuff matters!
This is an area I've dedicated a lot of time to when integrating MusicBrainz with my project, and it strikes me as something that MetaBrainz could spend time on to make the platform more accessible. Answering simple questions about music is currently quite difficult to newcomers on account of the overwhelming amount of data. Consider a world where it's possible to stream every recording from every release in the MusicBrainz database: it should be easier to make "Alexa, play Dark Side of the Moon" work without it needing to ask whether I mean the 1994 Netherlands CD release.
(FWIW, it's totally possible to build these heuristics on top of MusicBrainz today, but having better built-in support for determining this stuff would be nice. Spotify is absolutely amazing at figuring out what song in its entire catalog should be the top result even when I've only typed a few characters.)
"Canonical" can't apply to music releases in an objective and definitive way.
Context certainly matters (first, modified, compilation, remaster, remix, audiophile pressing, and so on) but you can't even nail "canonical" to first release, especially for singles, because there may be early promo mixes, radio mixes, vinyl mixes, iTunes mixes, and so on - all mastered differently.
Most people's idea of "canonical" is really "The version I want to hear without having to specify other details". But that's subjective and likely to be significantly different for some non-trivial percentage of users, especially in different territories.
Spotify probably just makes an informed stab at "most popular" - which is a good heuristic and will work most of the time, but is hard to calculate when you don't have Spotify's stats.
We may never have the stats Spotify has, but we are trying to get listening information via the in-development ListenBrainz: https://listenbrainz.org
I'm not sure when/if we'll be able to tie it in with MusicBrainz directly, but for someone like exogen, ListenBrainz may be a good basis to figure out relative popularity of various Recordings/Tracks regardless.
You've described the issue pretty well, and I understand (and agree with!) all of that – like I said, I've devoted a LOT of time to solving this.
> Most people's idea of "canonical" is really "The version I want to hear without having to specify other details". But that's subjective and likely to be significantly different for some non-trivial percentage of users, especially in different territories.
Yup! You are describing the problem literally any search engine faces. And yet, Google/Bing/etc. provide pretty smart results. So, do you think the "Smells Like Teen Spirit" recording by Francis Drake is the BEST first result, as MusicBrainz says it is? Is a live bootleg recording the BEST second result? In any locale? MusicBrainz is NOT primarily a search engine, but all that data has very little value if people (and other software) can't actually find it! This absolutely harms adoption.
OK, so we might not need to nail down a "canonical" version when we live in a world with search ranking scores. I totally realize "canonical" is a bad word choice on my part – but it's really how people think of these things!
> Spotify probably just makes an informed stab at "most popular" - which is a good heuristic and will work most of the time, but is hard to calculate when you don't have Spotify's stats.
I bet they do it that way too, but I think you're throwing in the towel way too early here. :) I have a system that works amazingly well and nearly always chooses the most likely intended recording without any listen count data. MusicBrainz has a LOT of data available to it, what type of heuristics might make sense here? I use a ranking system that takes all these factors into account and, like Lucene, assigns a score:
• Number of releases & release groups the recording appears on (the most well-known recording is more likely to appear on additional albums like compilations, and more likely to be widely released in lots of countries).
• How old the release is relative to the other search results (earlier matches are more likely to be the original).
• Whether the recording is from a release with a "single from" relation to another album (the target LP is more likely to hold the recording we want).
• Whether it's from a release that's an Album or EP (positive weighting), or Live (negative weighting), whether the recording ONLY appears on Compilation albums (negative weighting), whether it's any other type of release like Bootleg (strong negative weighting).
• Whether the recording has ISRCs entered for it (more well-known recordings are more likely to have ISRCs in the first place, and also more likely for people to have entered them into MusicBrainz).
• Whether MusicBrainz users have entered any tags and ratings for it (weak but positive correlation with how popular it is).
• Domain-specific string similarity metrics; essentially, query expansion that makes sense specifically for song titles & artist names. This lets certain matches remain equivalent when it makes sense (e.g. "mambo number 5", "mambo no. 5", "mambo #5", "mambo number five" should all be exactly equivalent in terms of string matching. Lucene does some of this already of course, but not nearly enough – I have a query expander with hundreds of examples where Lucene does a worse job)
I can think of more too, that my system doesn't currently use. All that's without relying on any external data source! But if you want to go one better, it's also possible to correlate results with other APIs like WikiData, DBpedia, Spotify, YouTube…
In most cases, I've found that there's enough of a delta between the top score and the second-best score to determine which one is "correct". (Yes, that word, I know…)
Ideally MusicBrainz would be on par with a human expert in determining which recording you most likely meant, and I believe that it CAN do this today, but it doesn't.
Note that our current search server software is in "minimal maintenance" mode. We're working on a replacement which will hopefully allow for a lot of improvements to search rankings etc., but a lot of other things have higher priority (like actually being able to serve requests in spite of getting hammered by bots and spammers).
Of course, MusicBrainz is an open source endeavour. The old search server maintainer was a volunteer from the community. If you believe you can do a better job at running our search server, please join us in #metabrainz at Freenode and introduce yourself.
Also, note: in theory MusicBrainz already has metrics for the number of clicks, views, lookups, and edits certain entities get through their site and API. I bet these are strongly correlated with listens/popularity.
What does "in theory" mean here? Do those tables exist in whole or some part? Is this a matter of indexing an existing data set or hoping some data was acquired by accidental consequence?
Even if it's not collected though, it's data that they at least already have the ability to collect by simply flipping a switch, as opposed to spinning up a whole new ListenBrainz service and hoping it gains traction.
I use it daily for a couple of year now (to clean up the tags in my collection - quodlibet integrates nicely with musicbrainz for that) - and very rarely I am playing something that's not there.
When this is the case I try to add/edit the metadata, but most of the time, they're way ahead of me.
Consider a recording search for "smells like teen spirit". Any human with cursory knowledge of pop music would point you at the Nirvana single from the 1991 release of Nevermind (in this particular case, it's likely even true in every locale). But MusicBrainz has no notion of popularity, common sense, or the real-world context of any of its entities, so the recording from Nevermind isn't even on the first page of results. Heck, the first result isn't even a Nirvana recording. The second result is from an obscure live bootleg album. In my opinion this should be considered a bug. This stuff matters!
This is an area I've dedicated a lot of time to when integrating MusicBrainz with my project, and it strikes me as something that MetaBrainz could spend time on to make the platform more accessible. Answering simple questions about music is currently quite difficult to newcomers on account of the overwhelming amount of data. Consider a world where it's possible to stream every recording from every release in the MusicBrainz database: it should be easier to make "Alexa, play Dark Side of the Moon" work without it needing to ask whether I mean the 1994 Netherlands CD release.
(FWIW, it's totally possible to build these heuristics on top of MusicBrainz today, but having better built-in support for determining this stuff would be nice. Spotify is absolutely amazing at figuring out what song in its entire catalog should be the top result even when I've only typed a few characters.)