| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sciboy 5070 days ago
	B = (X^TX)^(-1)X^TY anyone?

6 comments

rohitarondekar 5070 days ago

In computing (X^TX)^(-1) if the number of features is large then it can be slow as computing the inverse of a matrix is slow. Also unless you use pseudo inverse (pinv in octave) you need to take care of degenerate cases. However if you use Regularization i.e replace the (X^TX)^(-1) with (X^TX + lambda*W)^(-1), where lambda is the regularization parameter and W is a matrix of the form:

  |0 0 0|
  |0 1 0|
  |0 0 1|

i.e identity matrix with (0,0) set to 0

This ensures that the matrix is now invertible. Regularization takes care of overfitting.

P.S I'm a ml n00b doing Machine Learning course on Coursera so I might be unaware of more practical knowledge of the above. :D

beagle3 5069 days ago

All regularization work I'm aware of uses W=I (an identity matrix). Where did you find this zero origin matrix?

Note that your W does not guarantee invertability - e.g., if your original (0,0) is already 0.

rohitarondekar 5069 days ago

This was shown by Professor Andrew in the Coursera ML class that's happening right now.

Given n features x1 to xn we introduce x0 feature which is always set to 1. During the Regularization lectures the professor said that we don't need to control (or regularize) the theta0 (the parameter for x0) because it doesn't make a difference. I believe this is the reason W(0,0) is set to 0.

The lectures are a little light on the maths, i.e the professor explains only enough maths to explain the techniques so I'm not aware of more details. I'm planning on watching some Linear Algebra lectures to fill in the gaps. :)

Re: Invertability, according to the professor, if lambda is > 0 then the matrix will be invertable. Again I'm not 100% sure if this is true or not.

beagle3 5069 days ago

Ok, that clears it up:

He doesn't need to set W(0,0) to 1 specifically because he sets x0 to 0 (which guarantees a non-zero value in the covariance matrix).

But the standard way to do L2 regularization (also known as "ridge regression") is to add a scaled identity matrix (the entire diagonal set to be nonzero)

rohitarondekar 5069 days ago

You mean set x0 to 1, right?

People who do linear regression at work don't add a x0 feature? During the lecture the prof. only said that adding a x0=1 for all samples m, is by convention and helps simplify the computation. Unless I missed something during the lecture that's the only explanation that was given.

beagle3 5068 days ago

Yes , I did, thanks.

> People who do linear regression at work don't add a x0 feature?

Sometimes they do that; sometimes the data already has a subset known to have sum 1 (e.g., if you binary variables that reflect "one of n choices" which must be set), and in this case adding x0=1 makes things worse (from a numerical perspective) for many algorithms.

Regardless, I've always seen regulation theory stated with lambda*identity matrices.

east2west 5069 days ago

DO NOT EVER use Penrose pseudoinverse in numerical computation. It is guaranteed to diverge.

rohitarondekar 5069 days ago

Can you please explain why this is? Or maybe point to an explanation somewhere? Thanks.

crntaylor 5069 days ago

Implementation tip - you don't need to invert that matrix! [1]. Whenever you see an equation of the form Ab = c and you want to find b, you should use a function equivalent to lu_solve (in the GNU Scientific Library) or the left-divide operator in Octave/Matlab (b = A\c), which will use the LU factorization algorithm [2] without ever computing A^(-1).

This is normally faster, more numerically stable and more space efficient. Even better, once you've computed the LU factorization of A once (which takes O(n^3) operations) you can then solve Ab = c for many different values of b and c in O(n^2) operations, by caching the factorization of A.

[1] http://www.johndcook.com/blog/2010/01/19/dont-invert-that-ma... [2] http://en.wikipedia.org/wiki/LU_decomposition

rohitarondekar 5069 days ago

According to Wikipedia[1]Matrix Inversion is also O(n^3) and according to the note can be even better by inverting smaller matrices within and multiplying[2].

My linear algebra isn't very good so I'll have to look into LU Factorization but is there vast difference between the computational performance of the two operations? (Assuming you don't need solve Ab=c for different values of b and c)

Also is LU Factorization used often in machine learning instead of inverse?

[1]: http://en.wikipedia.org/wiki/Computational_complexity_of_mat...

[2]: Because of the possibility to blockwise invert a matrix, where an inversion of an n×n matrix requires inversion of two half-sized matrices and 6 mulitplications between two half-sized matrices, and since matrix multiplication has a lower bound of Ω(n2 log n) operations[17], it can be shown that a divide and conquer algorithm that uses blockwise inversion to invert a matrix runs with the same time complexity as the matrix multiplication algorithm that is used internally. source is [1]

beagle3 5069 days ago

The reason you want to avoid matrix inversion is not computational performance - it is precision.

You can generally "multiply by the inverse" without actually computing the inverse, in a way that needs less intermediate floating point precision.

If your matrices have a large spread of eigenvalues, it makes a lot of difference - double precision often doesn't have enough precision for direct inversion in the real world.

(And even if you do want to invert, it is more numerically stable to do that as a solution of Ax=e, for each 'e' a basis element, as long as you compute A^-1 \* e indirectly using a numerically stable method)

rohitarondekar 5069 days ago

Thank you for explaining, I think I get it now. The ML course on Coursera is a little light on the Maths. As in the professor only explains as much is needed to perform the techniques. So I think I'll be watching Strangs Linear Algebra lectures on OCW soon. :)

zissou 5070 days ago

Did somebody say B = (X'X)^(-1)X'Y?

Here is one of my favorite OLS-related abuses of linear algebra: y = XB + e ---> e = y - XB ---> e = y - X(X'X)^(-1)X'Y ---> e = [I - X(X'X)^(-1)X'Y]y = My

Now use the two idempotent matrices to compute the SSR: e'e = (My)'(My) ---> e'e = y'M'My ---> e'e = y'MMy ---> e'e = y'My

carbocation 5070 days ago

Definitely. The jump from univariate regression to multivariate regression is a non-obvious step because the univariate approach does not generalize in a straightforward fashion.

oscilloscope 5070 days ago

http://en.wikipedia.org/wiki/Regression_analysis#General_lin...

niggler 5070 days ago

Put a pseudoinverse in there :)