Yeah, dergachev is pointing to one of the cool direct applications of linear algebra to machine learning.
Suppose you are given the data D = (r_1; r_2; r_3) where each row r_i is an n-vector r_i=(a_i1, a_i2, ..., a_in, b_i).
Each row consists of some observation data. We want to predict future b_j given the future \vec{a}_j, given that we have seen {r_i}_{i=1...N} (The data set consists of N data rows r_i where both \vec{a}_i and b_i are known).
One simple model for b_j given \vec{a}_i = \vec{a}_i = (a_i1, a_i2, ..., a_in) is
a linear model with $n$ parameters m_1,m_2,...,m_n:
If the model is good then y_m(\vec{a}_i) approximates b_i well.
But startup wisdom dictates that we should measure everything!
Enter error term:
e_i(\vec{m}) = |y_m(\vec{a}_i) - b_i|²,
the squared absolute-value of the difference between the model's prediction and the actual output---hence the name error term.
Our goal is to make the sum $S$ of all the error terms as small as possible.
S(\vec{m}) = \sum_{i=1}^{i=M} e_i(\vec{m})
Note that the "total squared error" is a function of the model parameters \vec{m}.
At this point we have reached a level of complexity that becomes difficult to follow.
Linear algebra to the rescue!
We can express the "vector prediction" of the model y_m in "one shot" in terms
of the following matrix equation:
A\vec{m} = \vec{b}
where A is an N by n matrix (contains the a_:: part of the data)
\vec{m} is an n by 1 vector (model parameters---the unknown)
\vec{b} is an N by 1 vector (contains the b_: part of the data)
To find \vec{m}, we must solve this matrix equation,
however A is not a square matrix: A is a tall skinny matrix N >> n,
so there is no A^{-1}.
Okay so we don't have a A^{-1} to throw at the equation A\vec{m}=\vec{b} to cancel the A, but what else could we throw at it. Let's throw A^T at it!
A^T A \vec{m} = A^T \vec{b}
\ /
N (an n by n matrix)
N \vec{m} = A^T \vec{b}
Now the thing to observe is that if N is invertible,
then we can find \vec{m} using
\vec{m} = N^{-1} A^T \vec{b}
This solution to the problem is known as the "least squares fit" solution (i.e. choice of parameters for the model m_1, m_2, ..., m_n). This name comes from the fact that the vector \vec{m} is equal to the output of the following optimization problem
Technical detail: the matrix N=A^T*A is invertible if and only if the columns of A are linearly independent.
TL;DR. When you have to do a "linear regression" model of data matrix X and labels \vec{y},
the best (in the sense of least squared error) linear model is \vec{m} = (X^T X)^{-1} X^T \vec{y}.
Suppose you are given the data D = (r_1; r_2; r_3) where each row r_i is an n-vector r_i=(a_i1, a_i2, ..., a_in, b_i). Each row consists of some observation data. We want to predict future b_j given the future \vec{a}_j, given that we have seen {r_i}_{i=1...N} (The data set consists of N data rows r_i where both \vec{a}_i and b_i are known).
One simple model for b_j given \vec{a}_i = \vec{a}_i = (a_i1, a_i2, ..., a_in) is a linear model with $n$ parameters m_1,m_2,...,m_n:
If the model is good then y_m(\vec{a}_i) approximates b_i well.But startup wisdom dictates that we should measure everything!
Enter error term:
the squared absolute-value of the difference between the model's prediction and the actual output---hence the name error term.Our goal is to make the sum $S$ of all the error terms as small as possible.
Note that the "total squared error" is a function of the model parameters \vec{m}.At this point we have reached a level of complexity that becomes difficult to follow. Linear algebra to the rescue!
We can express the "vector prediction" of the model y_m in "one shot" in terms of the following matrix equation:
To find \vec{m}, we must solve this matrix equation, however A is not a square matrix: A is a tall skinny matrix N >> n, so there is no A^{-1}.Okay so we don't have a A^{-1} to throw at the equation A\vec{m}=\vec{b} to cancel the A, but what else could we throw at it. Let's throw A^T at it!
Now the thing to observe is that if N is invertible, then we can find \vec{m} using This solution to the problem is known as the "least squares fit" solution (i.e. choice of parameters for the model m_1, m_2, ..., m_n). This name comes from the fact that the vector \vec{m} is equal to the output of the following optimization problem Proof: http://en.wikipedia.org/wiki/Linear_least_squares_(mathemati...Technical detail: the matrix N=A^T*A is invertible if and only if the columns of A are linearly independent.
TL;DR. When you have to do a "linear regression" model of data matrix X and labels \vec{y}, the best (in the sense of least squared error) linear model is \vec{m} = (X^T X)^{-1} X^T \vec{y}.