Hacker News new | ask | show | jobs
by kaffeinecoma 3704 days ago
People are knocking this guy for not being an expert and maybe getting some details wrong. Maybe it's a little bit like watching a non-programmer stumble their way through a blog post about learning to program- experienced programmers may cringe a bit.

But I really appreciate these kinds of write-ups: he declares his non-expertise up-front, and then proceeds to document his understanding as he goes along. There's something useful about this kind of blog post for non-experts.

I'm working my way through Karpathy's writeup on RNNs (http://karpathy.github.io/2015/05/21/rnn-effectiveness). I've mechanically translated his Python to Go, and even managed to make it work. But I still don't entirely understand the math behind it. Now obviously Karpathy IS an expert, but despite his extremely well-written blog post, a lot of it is still somewhat impenetrable to me ("gradient descent"? I took Linear Algebra oh, about 25 years ago). So sometimes it's nice to see other people who are a bit bewildered by things like tanh(), yet still press on and try to understand the overall process.

And FWIW I had the same reaction as the author when I started toying around with neural nets- it's shocking how small the hidden layer can be and still do useful stuff. It seems like magic, and sometimes you have to run through it step-by-step to understand it.

5 comments

Sorry about that! There's a lot to cover for one blog post to do satisfyingly. I encourage you to check CS231n for a more thorough treatment where we also discuss, for example, the tradeoffs of different activation functions like tanh(), have a more gentle introduction on gradient descent, I devote a whole lecture to char rnn, assignment #1 (they are available) would demystify the backward pass, etc.

Also definitely +1 for not putting down people who write similar posts. I encourage everyone who is trying to learn to do it through blog posts because it lets you explain/organize thoughts. I also enjoy reading them quite a bit because it illustrates the kinds of conceptual problems beginners face (which is not at all obvious once you've been in the area for a few years). And it's also interesting to see many different interpretations of the same concepts, as everyone has different background and the way they reason through things is usually quite unique. Granted, this one could have been named something more appropriate!

No need to apologize- I learned SO much from your blog, thank you. I didn't realize the course was online (https://www.youtube.com/watch?v=NfnWJUyUJYU). Also, looks like there's a subreddit for it as well: https://www.reddit.com/r/cs231n

It's really wonderful that all of this is freely available, thank you.

The lecture that covers gradient descent in the Youtube list you linked there is the first time gradient descent actually clicked for me, and I made it through the entire Andrew Ng Coursera ML course. Highly highly recommend it.
the video became private, anyone know the title of the video or is there another copy of it somewhere else?
> People are knocking this guy for not being an expert and maybe getting some details wrong.

I think this style of teaching has great value. Someone who's learning something themselves is the person most suitable to teach it to others, since they know exactly what a novice user doesn't know. For example, I wanted to write something up for monads the other day, since it's a simple concept that's made super confusing by people who dive into mathematical notation right away. The downside with this approach is that the novice lacks experience, so what they're learning may not be entirely accurate.

I think the best approach is a hybrid: Someone who is learning the material explains it, and someone who already knows it points out mistakes. In this case, HN can serve as the expert, and we all end up with a very informative post.

One of my great regrets with leaving my last job was that: I had little FP experience, and we had one guy with a lot of it. We had done informal teaching sessions, and had planned to try to write blog posts / record podcasts of our little sessions, precisely for the reasons you mentioned, hoping that it would help others absorb the material more easily.
Why can't you still do it? I'd read that.
There's probably no reason I can't still do it, aside from physical distance problems and getting the free time lined up on both sides, which was far easier when we could just schedule it during working hours as 20% time.

If we/I ever do it, I'll make sure to send you a link. :)

Nando De Freitas has a great youtube channel with videos that you might find helpful, including this one in particular on unconstrained optimization: https://www.youtube.com/watch?v=QGOda9mz_yA&list=PLE6Wd9FR--...
FYI, gradient descent is covered in one of the very first weeks of Andrew Ng's Coursera machine learning class, so perhaps just watch those lessons (free)

Gradient descent is the approximation solution basically because getting the exact solution requires a good computation of inverse matrices which is apparently not yet doable (it's too slow)

I think the reason people do gradient descent is that the datasets are too large to solve for all inputs simultaneously. It isn't impossible in theory, really.
Do you mean to say that it is possible to design your parameters over all inputs without gradient descent? I'm somewhat confused, as I think that that would not be possible in the general case (e.g. nonlinear problems are hard to crack without resorting to an iterative procedure like gradient descent). I can see that gradient descent might still make sense for problems that do have clean analytic solutions (if that's what you meant), as those solutions often turn out to be junk at scale. Linear regression is a good example, as it has a nice closed form expression if the solution exists. But the complexity scales poorly as the naive implementation requires a matrix inversion, so a different method might be employed for a large problem - gradient descent could be a candidate.

I think gradient descent is attractive because it's a memoryless process at the batch level - you can process training data in batches instead of processing the entire dataset in one go, without any explicit tracking of the previous batch history. This is a great feature when the scale of your dataset is mind-boggling. I think this is what you were suggesting?

Strictly speaking if you split the parameter set on batches and iterate over batches optimizing each set of parameters with a gradient, it is not strictly a gradient decent, it is more a combination of coordinate decent (because you select the subset of coordinates to optimize first) and a gradient decent.
Ah yes - that sounds like the stochastic gradient descent I've been hearing about. That makes a lot of sense for very expensive models. Thanks for the response nshm - I've recently taken an interest in ML (coming in with some familiarity with optimization), and it's much appreciated to have some 'REPL' in the learning process.