Hacker News new | ask | show | jobs
by CorneliaKara 3529 days ago
[disclaimer: I work on the Computer Vision API] Good examples! It looks like you were using the Image Captioning operation in the Computer Vision API. I would think that, for the cases where the output was not correct, the API returns a low confidence score; it really depends on your scenarios, but, in my own testing, a caption with <40% confidence score is likely to have incorrect info. Now, to explain what's going on a bit better: the vision models behind the API were trained with a large body of images; you can imagine that coat of arms or images of astronauts weren't very prevalent (while images of giraffes or motorcycles were). We continue to improve the vision models over time, so seeing feedback like this on HN (or StackOverflow, or the User Voice forum on Microsoft.com/cognitive) helps!
1 comments

Thanks for the response Cornelia.

I can see that the Computer Vision API does return some useful information. E.g. it appears to discriminate well between abstract images and photos. I appreciate the inclusion of scores with the returned information.

However, the captioning reliably produces odd results. I Googled, "Italian guy eating pizza." To fit the person verbing a common noun model. This was the first non-cartoon image for me:

https://s-media-cache-ak0.pinimg.com/564x/68/c6/cf/68c6cf87b...

And the caption:

{ "type": 0, "captions": [ { "text": "a man and a woman eating a plate of food", "confidence": 0.44831967045071774 } ] }

The woman in question is, I presume, the small statue of the Virgin Mary stood next to the pizza.

There were also a few things I thought would fail but didn't. E.g. distinguishing preparing food from eating it. This was nice.