Hacker News new | ask | show | jobs
by jatsign 3183 days ago
How do CNNs work when the output is multiple categories? For instance, in the same image is a cat and a dog and a car. What's the architecture look like - multiple CNNs, each that can predict one category? Or does one CNN have multiple outputs and if the score > threshold, add that category to the list shown to the user?

Also, how do CNNs draw a box around the target in the image?

3 comments

First question: The network is trained to recognize a fixed set of outputs. That's what makes it a classifier -- it classifies an input into a single output. It does this by giving each output possibility a score, and the highest score is its guess for what the original image is. So if I have a network that I train to recognize cats, dogs, and cars, and I get an output like {cat: .13, dog: .85, car: .02} Then the input was most likely a dog. The network calculates all of those values simultaneously.

You can, of course, tell the network to output whatever you want: all of the guesses, best guess, top five guesses, all guesses over a threshold, etc.

Note, this is a gross oversimplification, but it gets the general concept across.

The first question has been widely answered, so let me jump to the second question -- bounding boxes.

In the past, some people did this inefficiently by just sliding a window across the image and using the same classifier you'd use for the first problem. But this is inefficient, and different sizes make it more inefficient. So the best solution is to use an "Object Detection" network, look into SSD or YOLO to see an example of this.

What's the architecture look like - multiple CNNs, each that can predict one category?

Effectively. That's what a DNN is, ANNs with multiple inference layers each one gives their "highest probability" and then the client/system sets the weight threshold for what is returned.