Hacker News new | ask | show | jobs
by bernardopires 3557 days ago
Just a nit, but the author keeps talking about object recognition while what he was actually doing is image classification. Object recognition actually consists of two tasks, one is classifying the object (this is a beer bottle) and the other is also says where in the image the object is. Additionally it can/should detect multiple objects in the image. This is a more complex than classification, which only associates one category with the image.
3 comments

I don't think there's a consistent terminology. In my computer vision class we called it "object recognition" when it was about recognizing one specific object (this particular car) and "object classification" when deciding the category of the object in the image (in general, like 'car', 'bottle').

One may also call the localization of the object as object detection and subsequent classification.

But I don't think it's too important how we call it as long as we understand what the task is.

http://pjreddie.com/darknet/yolo/, https://github.com/daijifeng001/MNC, https://bitbucket.org/aquariusjay/deeplab-public-ver2 or similar should do the job. Choose depending on how fast it needs to be, and how accurate the segmentation boundaries need to be
Actually the tensorflow implementation he uses does both segmentation and classification and returns a probabilistic graph of objects. For his application, it's only returning the top result, so it looks more basic than it is.
No, it doesn't and there is no graph returned whatsoever. It's just a list of the top classification labels for the image (see example at the tutorial he cited https://github.com/tensorflow/tensorflow/tree/master/tensorf...). This is not the result of a segmentation but is rather a list of the top labels the model believes this could be. If you look at the top results you'll see they're usually similar/in the same family (again, refer to the example in the linked tutorial, the top 3 labels are: military uniform, suit, academic gown). This is literally the normalized output of the nodes of the last layer in the neural network (where each node corresponds to one category). If you added all probabilities together it'd sum to 1.
That's my point. With these OTS modules they are only returning on known classifiers.

The system has to segment before it classifies. That isn't returned to the user, but gradient descent is happening in the background. Like I said, it's a nitpick but important if you're trying to really build novel CV applications.

One of my gripes with people implementing pre-built modules from TF is that you don't really build any of the hard stuff, and it's pre-trained so not much learning is happening. You can't for example build RL systems with off the shelf TF implementations.

Do you understand how convolutional neural networks work? There is no segmentation involved here at all. The input are the raw pixels of the image. The output is the probability this image belongs to one of the categories the network is capable of predicting.

Also gradient descent has nothing to do with segmentation at all, I don't understand what you're talking about. Gradient descent is used to find the set of weights that minimizes the error. This is standard in training neural networks of any kind using backpropagation.