Hacker News new | ask | show | jobs
by rohitarondekar 5023 days ago
In computing (X^TX)^(-1) if the number of features is large then it can be slow as computing the inverse of a matrix is slow. Also unless you use pseudo inverse (pinv in octave) you need to take care of degenerate cases. However if you use Regularization i.e replace the (X^TX)^(-1) with (X^TX + lambda*W)^(-1), where lambda is the regularization parameter and W is a matrix of the form:

  |0 0 0|
  |0 1 0|
  |0 0 1|
i.e identity matrix with (0,0) set to 0

This ensures that the matrix is now invertible. Regularization takes care of overfitting.

P.S I'm a ml n00b doing Machine Learning course on Coursera so I might be unaware of more practical knowledge of the above. :D

2 comments

All regularization work I'm aware of uses W=I (an identity matrix). Where did you find this zero origin matrix?

Note that your W does not guarantee invertability - e.g., if your original (0,0) is already 0.

This was shown by Professor Andrew in the Coursera ML class that's happening right now.

Given n features x1 to xn we introduce x0 feature which is always set to 1. During the Regularization lectures the professor said that we don't need to control (or regularize) the theta0 (the parameter for x0) because it doesn't make a difference. I believe this is the reason W(0,0) is set to 0.

The lectures are a little light on the maths, i.e the professor explains only enough maths to explain the techniques so I'm not aware of more details. I'm planning on watching some Linear Algebra lectures to fill in the gaps. :)

Re: Invertability, according to the professor, if lambda is > 0 then the matrix will be invertable. Again I'm not 100% sure if this is true or not.

Ok, that clears it up:

He doesn't need to set W(0,0) to 1 specifically because he sets x0 to 0 (which guarantees a non-zero value in the covariance matrix).

But the standard way to do L2 regularization (also known as "ridge regression") is to add a scaled identity matrix (the entire diagonal set to be nonzero)

You mean set x0 to 1, right?

People who do linear regression at work don't add a x0 feature? During the lecture the prof. only said that adding a x0=1 for all samples m, is by convention and helps simplify the computation. Unless I missed something during the lecture that's the only explanation that was given.

Yes , I did, thanks.

> People who do linear regression at work don't add a x0 feature?

Sometimes they do that; sometimes the data already has a subset known to have sum 1 (e.g., if you binary variables that reflect "one of n choices" which must be set), and in this case adding x0=1 makes things worse (from a numerical perspective) for many algorithms.

Regardless, I've always seen regulation theory stated with lambda*identity matrices.

DO NOT EVER use Penrose pseudoinverse in numerical computation. It is guaranteed to diverge.
Can you please explain why this is? Or maybe point to an explanation somewhere? Thanks.