Yeah, dergachev is pointing to one of the cool direct applications of linear alg...

Yeah, dergachev is pointing to one of the cool direct applications of linear algebra to machine learning.

Suppose you are given the data D = (r_1; r_2; r_3) where each row r_i is an n-vector r_i=(a_i1, a_i2, ..., a_in, b_i). Each row consists of some observation data. We want to predict future b_j given the future \vec{a}_j, given that we have seen {r_i}_{i=1...N} (The data set consists of N data rows r_i where both \vec{a}_i and b_i are known).

One simple model for b_j given \vec{a}_i = \vec{a}_i = (a_i1, a_i2, ..., a_in) is a linear model with $n$ parameters m_1,m_2,...,m_n:

   y_m(x_1,...x_n) =  m_1x_1 + m_2x_2 + ... + m_nx^n  =  \vec{m} · \vec{x}

If the model is good then y_m(\vec{a}_i) approximates b_i well.

But startup wisdom dictates that we should measure everything!

Enter error term:

   e_i(\vec{m}) = |y_m(\vec{a}_i)   - b_i|²,

the squared absolute-value of the difference between the model's prediction and the actual output---hence the name error term.

Our goal is to make the sum $S$ of all the error terms as small as possible.

   S(\vec{m}) = \sum_{i=1}^{i=M} e_i(\vec{m})

Note that the "total squared error" is a function of the model parameters \vec{m}.

At this point we have reached a level of complexity that becomes difficult to follow. Linear algebra to the rescue!

We can express the "vector prediction" of the model y_m in "one shot" in terms of the following matrix equation:

  A\vec{m} = \vec{b}

  where A is an N by n matrix  (contains the a_:: part of the data)
  \vec{m} is an n by 1 vector  (model parameters---the unknown)
  \vec{b} is an N by 1 vector  (contains the b_: part of the data)

To find \vec{m}, we must solve this matrix equation, however A is not a square matrix: A is a tall skinny matrix N >> n, so there is no A^{-1}.

Okay so we don't have a A^{-1} to throw at the equation A\vec{m}=\vec{b} to cancel the A, but what else could we throw at it. Let's throw A^T at it!

     A^T A \vec{m}  =  A^T \vec{b}
     \   /
       N   (an n by n matrix) 

       N \vec{m}  = A^T \vec{b}

Now the thing to observe is that if N is invertible, then we can find \vec{m} using

        \vec{m}  = N^{-1} A^T \vec{b}

This solution to the problem is known as the "least squares fit" solution (i.e. choice of parameters for the model m_1, m_2, ..., m_n). This name comes from the fact that the vector \vec{m} is equal to the output of the following optimization problem

    \vec{m}   =  minimize S(\vec{m}) over all \vec{m}

Proof: http://en.wikipedia.org/wiki/Linear_least_squares_(mathemati...

Technical detail: the matrix N=A^T*A is invertible if and only if the columns of A are linearly independent.

TL;DR. When you have to do a "linear regression" model of data matrix X and labels \vec{y}, the best (in the sense of least squared error) linear model is \vec{m} = (X^T X)^{-1} X^T \vec{y}.