Hacker News new | ask | show | jobs
by terda12 3720 days ago
Hopefully this helps (correct me if I'm wrong, I'm still learning about neural nets):

Think of the whole neural net as a function:

input * weight = output

At each iteration, we feed in the input to the neural net. Then the neural net compares what output it gets to the correct output.

For example, input1 is 5, and the correct output for input1 should have been 2. But the neural net got 3 as the output. So it then decreases the weights slightly so it would get 2.75 next time it has input of 5. Repeat thousands of times. That's the basic idea for machine learning and neural networks.

The algorithm it uses to figure out how much to decrease the weights is called "backpropagation" which uses gradient descent. To explain gradient descent, as as a roller coaster track. Imagine the roller coaster starts off on a random location on the track. Then gravity takes the roller coaster down the track until it ends up on a low point between two hills and stays there. This is the new location of the roller coaster. This new location is nice because it has the lowest energy the roller coaster could find, so it stays there. (We use derivatives to figure out the slope of a curve, which then gives us the direction where the curve goes downhill).

In neural networks, the roller coaster curve is the "cost function", which basically calculates the amount of difference between the neural net's output and the actual correct output it should have got. The initial weight is the roller coaster's initial position. The new weight is the roller coaster's final position, at the bottom of the cost function curve. This new position thus gives us the lowest cost.

Note that there may be even lower valleys, but when we roll the rollercoaster it stops at its nearest low valley. This is why we randomize the weights at the beginning - to put the roller coaster near possibly even lower valleys.

2 comments

Okay, so it works by minimizing (equiv. maximizing) some function. But that doesn't say much about how it "learns" the gradient. What function does it care about? Average squared error (predict_prob-Z_i)^2 ? Average absolute error? The likelihood function of some assumed distribution? Maximum distance between the classification border and closest observed points? If I saw someone carrying a bag full of blueberries and some bread home from the grocery store and asked to know how they chose to buy that, to which they replied "I had a list of characteristics which I thought where important for groceries to have in this trip to the store. For each grocery item, I recorded a vector of degrees to which the item possesses each of those characteristics. Finally, I chose the group of groceries that had the best combination of degree vectors", I still wouldn't really know anything about why they bought the blueberries and bread.
The function it minimizes is called the "loss function", and its value for the training and test sets are shown in the upper right area. AFAICT the site doesn't say how it's computed, but I think it's average squared error. The gradient is not learned; if you think of the loss function as a real-valued function of the weights, the gradient is just the partial derivatives with respect to the weights.
It depends on what you are trying to achieve. There are many, for e.g. see: http://cs231n.github.io/neural-networks-2/#losses
If I didn't get this course[1], I wouldn't understand what you are talking about.

[1] - https://www.coursera.org/learn/machine-learning