Hacker News new | ask | show | jobs
by chapel 5518 days ago
He addresses this in his article, but you really can't evaluate competitors to your own service in such a manner. Not only were the testing criteria highly subjective, the fact that songs he considered 'WTF' in the lists were the negative marks.

I know music is a very subjective subject, I really think he could have been more objective about the whole thing. Maybe using other peoples music collections and getting their own personal opinions on which songs work and which songs don't. Add to that, have other people rate the playlists he generated, so it wasn't just his opinion.

Also, wtf about Genius only getting 10 marks against it for not doing Beatles playlists? I'm sorry, if you hold the criteria that any songs that are out of place are considered negative, no songs in a playlist should be worth 24 'WTF' points. Not that it matters in the comparison, and it was nowhere near close.

I don't have access to the beta, and at this point don't really care about it, but this just screams as self promotion. I think it would have been a lot more respectable if he had been more objective concerning the tests he used.

6 comments

Paul has done a lot of work in the field of playlist evaluation.

I think that if you survey the researchers in the field, the "WTF test" would be considered fairly reasonable - especially for a quick-and-dirty evaluation. Can you point to any specific songs that he said were WTF's and you think aren't, or vice/versa? If not then it would appear to meet the objectivity criteria.

Using his own music collection might be slightly more suspect. Changing that might have flipped the outcome of iTunes vs EchoNest, but wouldn't have changed the real news here: Google does really, really badly.

I think you can argue the individual songs but the overall findings are sound - Google is poor, EchoNest is good, iTunes is good bar one song (which is suspicious). The methodology is a bit finger in the air but then so are the findings.
It is impossible to be objective about music. The point of his "WTF" test (not speaking for paul, but i do work with the guy) was to look at each song by each provider and ask "would most people agree this doesn't belong" erring very far on the generous inclusive side. You can definitely quibble about whether or not a particular song deserves a WTF mark but you can't argue that google's results are very terrible overall.

You couldn't do a test with other collections because the beta is very limited right now-- I can only think of a few people I know that have access. But I can concur that my results are as terrible as his were.

Yes, he works for EN (which he's very clear about) and yes it's a bit inside baseball showing a service that most people can't use (because they're not developers or customers) but you can scroll past our results & take it as a post titled "How is Apple so much better than Google at such a data driven task?"

(I would have given the lack of beatles on genius -24 too! I also found a couple EN clunkers that I'd give a WTF to, but I'm a notorious jerk for those things, ask anyone i work with :)

> you can scroll past our results & take it as a post titled "How is Apple so much better than Google at such a data driven task?"

Sure, but that wouldn't exactly be fair seeing as Genius is 3+ years mature, and Google Music+Instant Mix is in beta and less than a week old. OP even seems to think so...

> The last time I took a close look at iTunes Genius was 3 years ago. It was generating pretty poor recommendations.

Fair, but a few points:

- Genius was much better than this when it launched. It's even better now, but it wasn't as bad back then as Instant Mix is today.

- You don't think google has more data about music than Apple did when it launched Genius? YouTube, Music Onebox, search traffic.

- (severe bias alert) EN's playlist APIs are the same excellent quality today as the day it launched (sept '10.) We're roughly 0.10% Google's size. We didn't need any warm up period.

I'll absolutely agree that Google Music Instant Mix is nowhere where it needs to be.. not even close, but I still can't shake the feeling that this review is more of a "Please Buy Us" post...

Just curious, in regard to the quality of EN since launching.. You guys we're clearly working on your algorithm for several years prior to launch[1] in late 2010.. That's sort of a warm-up period, no?

[1] second last paragraph of http://blogs.oracle.com/plamere/entry/genius_or_savant_syndr... from 2008

You could call it a warm-up period, but the point is that period is pre-launch. I think it is fair to assume a team at Google worked on this project before launch as well. I don't know how long they did for, but they could have chosen to take the time to develop a better algorithm prior to launch as well. (Note: Google's historical use of the Beta tag, has made it effectively meaningless IMO.) For whatever reason they chose not to, and it's kind of surprising how poor a product they launched with.

Why do they think it is so important to get in this market ASAP? Perhaps under Page there is additional pressure to launch fast and early a la start-ups?

With that said, I have no idea if the quality of EN is as good as he claims.

It's quite a warm-up period.. Founded in 2005, first API went public in March 2008.

http://techcrunch.com/2008/03/27/first-machine-listening-api...

I still can't shake the feeling that this review is more of a "Please Buy Us" post...

..so? I'd honestly like to see more of this sort of advertising.

I agree music is subjective and his results are biased. However, it's safe to say Google's attempts are terrible.
His criteria are subjective, yes, but isn't that the point? The point of music recommendation engines isn't to figure out the absolute "best" playlist based on a starter song. It's to figure out the best playlist for the individual user. If the user is asking "WTF?" about the songs on his list, then by definition, the engine has failed that user.

While the author could have been more objective about his criteria, ironically enough, I think he missed the more salient point that he raised by implication: that music engines should be mapping the user's behavior patterns vis-a-vis the songs in his collection, and not so much objective connections between songs. This is what Genius does and has always tried to do, and it's why Genius seems to work better for the author. Genius doesn't focus as heavily on attempts to forge objective links between songs, so much as it focuses on attempts to draw links in behavior patterns w/r/t songs by likeminded users.

When listening to songs in a collection, our brain maps out its own connections between songs, as reflected in the way we compile our own lists consciously or subconsciously. Sometimes those connections make objective sense (i.e., "I want to listen to '70s funk, so I'm going to pick ten '70s funk songs in a row."). Sometimes those connections make little objective sense (i.e., "I am listening to a track by Lady Gaga, and afterward, I feel like listening to a track by J.S. Bach."). A good mixing engine figures out the idiosyncracies and subjectivity of our brains, as reflected statistically by the choices we've made in the past.

Hi Chapel - you suggest I could have been more objective, perhaps using other people's music collections and getting their own personal opinions. That, of course, would still be a subjective evaluation. It would be better, of course, more opinions means more data. In fact, I welcome people to make the same evaluation with their own collections. Enroll them in all 3 systems, create some playlists and evaluate them. Since most people don't have access to Google Music yet, this is hard to do. Still, you can look at the playlists that I generated and make your own WTF opinions about them. Or better yet, count the WTFs in the playlist Google created during the Google I/O keynote. You can see it at 28:29 of this - http://www.youtube.com/watch?v=OxzucwjFEEs Here's a screencap. https://skitch.com/plamere/r9x2k/youtube-google-i-o-2011-key...

There's no objective evaluation of playlists. I've proposed a simple, subjective one that I think gets the job done. I'm happy to try other ones if you have something to suggest.

Nitpick: an empty playlist has no songs out of place, so it should score 0 WTFs (I don't understand how you got length(empty)=24). Giving Apple some WTFs instead of giving it 0 is clearly more in the spirit of the test than following the criterion to the letter would have been.
While that might make sense in using a logical definition of what a "WTF" is in this context, assigning 0 WTFs for the inability to generate a playlist is useless for the purposes of this comparison.

From my perspective, I'd say an empty playlist is worth 24 WTFs, in the sense of "WTF, I expected 24 similar songs and got 0!"

I disagree.

Would you rather be given no results or bad results? I agree that no results isn't good but I'd rather an application that actually identified it couldn't do a good job than one that just threw a load of junk at me.

I think a score of 50% or so is probably about fair. It shouldn't get a good score certainly but I'd certainly rate it higher than one which just produced nonsense.

The playlists have 25 songs each, one of which is the original, so that's up to 24 opportunities to make a mistake.