Principal Component Analysis

October 29, 2019


In this post I cover some theoretical aspects of the well known PCA algorithm. Beware !! Linear Algebra and Vector Calculus abound in the following sections.

Problem Statement

Suppose I have. a collection of \(m\) points \(\{x_1,x_2,x_3,...,x_m\}, x_i \;\in \;\mathbb{R}^{n}\) and I want to store them using less memory, sacrificing precision in the process. One way I could do this is by encoding \(x_i \rightarrow c_i\) where \(c_i \; \in \; \mathbb{R}^l\) for some \(l<n\). Thus my requirements are -

  1. Encoding Function $f:\mathbb{R}^n \rightarrow \mathbb{R}^l$ such that $f(x) = c$
  2. Decoding Function $g: \mathbb{R}^l \rightarrow \mathbb{R}^n$ such that $g(c) \approx x$

Formulation

How to choose $c$ ? Or in other words, how to choose the encoding function ?

Now one way to get the optimum is to minimise the norm $ \vert\vert x-g(c) \vert\vert _{2} ^{2} $ over $c$. That is $c^* = argmin_c \vert\vert x-g(c)\vert\vert _2 ^2$

The Objective Function:

\(||x-g(c)||_2^2\) can be written as \((x-g(c))^T(x-g(c))\)

and \((x-g(c))^T(x-g(c)) = (x^Tx - x^Tg(c) -g(c)^Tx + g(c)^Tg(c))\)

Now,

\(x^Tg(c) = g(c)^Tx\) [ as the transpose of a scalar = the scalar ]

\[\therefore \;objective = x^Tx - 2x^Tg(c) + g(c)^Tg(c)\]

Now for minimising \(objective\) w.r.t $c$,

\[c^* = argmin_c \; -2x^Tg(c) + g(c)^Tg(c)\]

This can be solved using Gradient Descent ,

Calculating the Gradient

First assuming for simplicity, \(g(c) = Dc\) where \(D \in \mathbb{R}^{n \times l}\), \(x \in \mathbb{R}^{n \times 1}\) , \(c \in \mathbb{R}^{l \times 1}\) we need to calculate, \(\nabla _c -2x^TDc +(Dc)^TDc\) \(= \nabla _c -2x^TDc + c^T(D^TD)c\)

To further make things easier for ourselves we constrain \(D\) to have linearly independent columns and the norm of each column to be \(1\)

This implies that, \(D^TD = I_l\)

How is \(D^TD = I_l\)

Recall that \(D\) has 2 properties -

  1. Columns of \(D\) are linearly independent
  2. Columns of \(D\) have unit norm

Visualising \(D^TD\), \(c_i\) is the \(i^{th}\) column of \(D\)

\[\begin{pmatrix} c_1 \rightarrow & - & - & - \\ c_2 \rightarrow & - & - & - \\ c_3 \rightarrow & - & - & - \\ \\ \vdots & \vdots & \vdots & \vdots \\ c_l \rightarrow & - & - & - \end{pmatrix}\begin{pmatrix} c_1 & c_2 & \ldots & c_l \\ \downarrow & \downarrow & \downarrow & \downarrow \\ & & & \\ \\ \vdots & \vdots & \vdots & \vdots \\ - & - & - & - \end{pmatrix} = A\]

Looking at the matrices, its obvious that, \(A_{1 \times 1} = c_1^Tc_1 = 1\) [ Because of unit norm ]

\(A_{1 \times 2} = c_1^Tc_2 = 0\) [ Because of linear independence ]

Therefore, by following the pattern, we see that \(A = I_l\)

Coming back to the problem of solving the gradient, \(\nabla _c -2x^TDc + c^TI_lc\)

Now using the identities,

\[\begin{align} \nabla _x Ax &= A \\ \nabla _x c^TAc &= c^T(A^T + A) \end{align}\]

We get, \(grad = -2x^TD + 2c^T\)

Gradient Descent converges when the gradient is $0$.

\[\begin{align} \therefore -2x^TD + 2c^T &= 0\\ \Rightarrow c^T &= x^TD\\ \Rightarrow c &= D^Tx \end{align}\]

Thus for a given $x$, the optimal encoding as per the scheme would be $c^* = D^Tx$ Or we can say that,

\[f(x) = D^Tx=c \;\]

Choosing a Suitable \(D\)

Now that we have found the optimum encoder for a given point, how to find the optimum $D$ for all points in the given set.

Choosing a similar approach as earlier, we now need to optimise over $D$.

Therefore, we need to find $D^*$ such that

\[\begin{align} D^* &= argmin_D \; \displaystyle\sum _i \vert\vert x^{(i)} - g(f(x^{(i)}))\vert\vert^2 _2\\ &= argmin_D \; \displaystyle\sum_i\vert\vert x^{(i)} - DD^Tx^{(i)}\vert\vert^2 _2 \end{align}\]

Suppose for ease of understanding, \(l=1\), this means that \(D \in \mathbb{R}^{n \times 1}\) and \(x^{(i)} \in \mathbb{R}^{n \times 1}\) Renaming \(D := d\), that is \(d \in \mathbb{R}^{n \times 1}\) and seeing that \(d^Tx^{(i)}\) is a scalar, we rearrange as

\[d^* = argmin _d \displaystyle\sum_i \vert\vert x^{(i)} - (x^{(i)})^Tdd \vert\vert ^2 _2\]

Insteam of summing all over the \(m\) points, putting all \(x^{(i)}\) into a matrix \(X\) such that \(row_i = (x^{(i)})^T\)

How to get rid of the summation ??

Lets look at the original function, $x^{(i)} - (x^{(i)})^Tdd$. This looks like, for some point \(x^{(i)}\)

\[\begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} - \begin{pmatrix} x_1 & x_2 & \ldots & x_n \end{pmatrix}\begin{pmatrix} d_1 \\ d_2 \\ \vdots \\ d_n \end{pmatrix}\begin{pmatrix} d_1 \\ d_2 \\ \vdots \\ d_n \end{pmatrix}\]

Thus \((x^{(i)})^Td\) would be a scalar equal to \(\sum_{k=1}^nx_kd_k = \lambda_i\) and Final vector would be

\[\begin{pmatrix} x_1 - \lambda_id_1 \\ x_2 - \lambda_id_2 \\ \vdots \\ x_n - \lambda_id_i \end{pmatrix}\]

The norm of this vector would be \(\displaystyle\sum_k (x_k-\lambda_id_k)^2\)

Let’s try replacing \(x^{(i)}\) with $X$, $X$ looks like

\[\begin{pmatrix} x^{(1)} \rightarrow & - & - & - \\ x^{(2)} \rightarrow & - & - & - \\ \vdots & \vdots & \vdots & \vdots \\ x^{(m)} \rightarrow & - & - & - \\ \end{pmatrix}\]

Since finally we’ll be getting a matrix, we’ll probably be using a Frobenius Norm. Hence If the norms have to be equivalent, the final matrix should look like

\[\begin{pmatrix} x^{(1)}_1 - \lambda_1d_1 & x^{(1)}_2 - \lambda_1d_2 & \ldots & x^{(1)}_n - \lambda_1d_n \\ x^{(2)}_1 - \lambda_2d_1 & x^{(2)}_2 - \lambda_2d_2 & \ldots & x^{(2)}_n - \lambda_2d_n \\ \vdots & \vdots & \ldots & \vdots \\ x^{(m)}_1 - \lambda_md_1 & x^{(m)}_2 - \lambda_md_2 & \ldots & x^{(m)}_n - \lambda_md_n\end{pmatrix}\]

And this is equal to

\[\begin{pmatrix} x^{(1)} \rightarrow & - & - & - \\ x^{(2)} \rightarrow & - & - & - \\ \vdots & \vdots & \vdots & \vdots \\ x^{(m)} \rightarrow & - & - & - \\ \end{pmatrix} - \begin{pmatrix} \lambda_1d_1 & \lambda_1d_2 & \ldots & \lambda_1d_n \\ \lambda_2d_1 & \lambda_2d_2 & \ldots & \lambda_2d_n \\ \vdots & \vdots & \vdots & \vdots \\ \lambda_md_1 & \lambda_md_2 & \ldots & \lambda_md_n \\ \end{pmatrix}\]

Notice in the second matrix, there is a pattern in the occurences of the \(d_i\)’s and it can be decomposed into

\[\begin{pmatrix} \lambda_1 \\ \lambda_2 \\ \vdots \\ \lambda_m \end{pmatrix} \begin{pmatrix} d_1 & d_2 & \ldots & d_n \end{pmatrix}\]

The second matrix is nothing but \(d^T\). For the first matrix, we need

\[\begin{pmatrix} x^{(1)}_1d_1 + x^{(1)}_2d_2 + \ldots \\ x^{(2)}_1d_1 + x^{(2)}_2d_2 + \ldots \\ \\ \vdots \\ \\ x^{(m)}_1d_1 + x^{(m)}_2d_2 + \ldots \end{pmatrix} = \begin{pmatrix} x^{(1)}_1 & x^{(1)}_2 & \ldots & x^{(1)}_n \\ x^{(2)}_1 & x^{(2)}_2 & \ldots & x^{(2)}_n \\ \\ \vdots \\ \\ x^{(m)}_1 & x^{(m)}_2 & \ldots & x^{(m)}_n \end{pmatrix} \begin{pmatrix} d_1 \\ d_2 \\ \vdots \\ d_n \end{pmatrix} = Xd\]

Thus the optimization reduces to,(Frobenius Norm)

\[d^* = argmin_d \vert\vert X-Xdd^T \vert\vert ^2 _F\]

The Frobenius Norm can be written in the form of a Trace Operator.

\[||A||^2 _F = Tr(AA^T)\]

Thus our objective function becomes \(Tr((X-Xdd^T)(X-Xdd^T)^T)\)

By property \(Tr(AB) = Tr(BA)\),

We need to minimize w.r.t $d$, $Tr((X-Xdd^T)^T(X-Xdd^T))$

\(= Tr(X^TX - X^TXdd^T - (Xdd^T)^TX + (Xdd^T)^T(Xdd^T))\) \(= Tr(X^TX - X^TXdd^T - dd^TX^TX + dd^TX^TXdd^T)\)

Since we need to minimise over $d$ and the first term does not contain $d$, we can discard it. So, we need,

\[argmin_d -Tr(X^TXdd^T) - Tr(dd^TX^TX) + Tr(dd^TX^TXdd^T)\]

Using cycling property of $trace$,

\[argmin_d -2Tr(X^TXdd^T) + Tr(dd^TX^TXdd^T) \\ = argmin_d -2Tr(X^TXdd^T) + Tr(d^TX^TXdd^Td)\\ \because d^Td = I_1 = 1 \\ = argmin_d -2Tr(d^TX^TXd) + Tr(d^TX^TXd) \\ \therefore d^* = argmin_d -Tr(d^TX^TXd)\]

This can be solved Using EVD. When, \(d = eignenvector \; of \; X^TX\), \(X^TXd = \alpha d\) \(\therefore d^* = argmin_d - \alpha Tr(d^Td)\) \(\Rightarrow d^* = argmin_d -\alpha\)

Thus to minimise \(d^*\), we need to choose eigenvector with greatest \(\alpha\). ( eigenvalue )

This derivation was for \(l=1\), which I assume gives us the first principal component. Now to prove that for \(l=2\), \(D\) is the matrix of the greatest \(2\) eigenvectors.

https://stats.stackexchange.com/questions/279416/how-to-prove-pca-using-induction

Generalizing from \(l=1\)

The objective that we want to optimise $argmin_D \vert\vert X-XDD^T \vert\vert _F^2$.
which according to the above discussion simplifies to $D^* = argmin_D \; - Tr(D^TX^TXD)$

For generalising to some \(l\), the author suggests induction. I have not been able to come up with anything. So here is an attempt from my side.

Since for \(l=1\), the eignevector corresponding to the maximum eigenvalue is the solution.

Let \(X^TX := A\), then \(A \in \mathbb{R}^{m \times m}\) as \(X \in \mathbb{R}^{m \times n}\)

Suppose \(D\) is the matrix of unit norm eigenvectors \(v^{(i)}\) s of \(A\) corresponding to \(l\) greatest eigenvalues.

\(D = \begin{pmatrix} v^{(1)} & v^{(2)} & \ldots & v^{(l)} \\ \downarrow & \downarrow & \downarrow & \downarrow \end{pmatrix}_{m \times l}\) such that \(v^{(1)} = \begin{pmatrix} v^{(1)}_1 & v^{(1)}_2 & \ldots & v^{(1)}_n \end{pmatrix}^T\) and \(\vert\vert v^{(i)} \vert\vert _2 =1\)

Few things to note about \(A\),

$Av^{(1)} = \lambda v^{(1)}$

$A$ is symmetric, therefore, the eigenvectors of \(A\) are orthogonal to each other That is $v^{(i)} . v^{(j)} = 0 \;, i \neq j$

$A$ is positive semidefinite i.e. $x^TAx \geq 0 \; \forall \; x$

$y^T(X^TX)y = (y^TX^T)Xy = (Xy)^TXy = \vert\vert Xy \vert\vert ^2 \geq 0 \; \forall \; y$

Then,

\[\begin{align} AD &= \begin{pmatrix} \lambda_1v^{(1)} & \lambda_2v^{(2)} & \ldots & \lambda_l v^{(l)} \\ \downarrow & \downarrow & & \downarrow \end{pmatrix}_{m \times l}\\ \therefore D^TAD &= \begin{pmatrix} \lambda_1 v^{(1)}. v^{(1)} & \lambda_2 v^{(1)}. v^{(2)} & \ldots & \lambda_l v^{(1)}. v^{(l)} \\ \\ \vdots & \vdots & & \vdots \\ \\ \lambda_1 v^{(l)} . v^{(1)} & \lambda_2 v^{(l)}.v^{(2)} & \ldots & \lambda_l v^{(l)}.v^{(l)} \end{pmatrix} \\ \\ &= \begin{pmatrix} \lambda_1 ||v^{(1)}||^2 & 0 & \ldots & 0 \\ 0 & \lambda_2||v^{(2)}||^2 & \ldots & 0 \\ \\ \vdots & \vdots & \vdots & \vdots\\ \\ 0 & 0 & \ldots & \lambda_l||v^{(l)}||^2 \end{pmatrix}_{l \times l} \end{align}\]

Therefore. \(Tr(D^TAD) = \displaystyle\sum_i \lambda_i \vert\vert v^{(i)} \vert\vert ^2 = \displaystyle\sum_i \lambda_i\)

Since \(A\) is positive semidefinite, \(\lambda_i \geq0 \; \forall \; i\)

https://math.stackexchange.com/questions/518890/prove-that-every-positive-semidefinite-matrix-has-nonnegative-eigenvalues

Therefore \(\displaystyle\sum_i \lambda_i\) is maximised when we take the greatest \(l\) eigenvalues of \(X^TX\)

Code

Now to code it. What all do we require ?

  1. For \(f(X)=D^TX\), and \(D\) is the matrix of eigenvectors. Therefore, we need a function to find eigenvectors efficiently. Also we need a function to multiply matrices efficiently.

Application

How does image compression work ?

Image compression is of two types -

Lossless Compression Ill be looking at this as the original image can be retrieved exactly as it was before compressing. Formats which implement lossless compression are -

  1. PNG - Uses DEFLATE compression algorithm to compress images.
  2. TIFF - Uses the run length encoding algorithm to compress data.

Lossy Compresssion During lossy compression information is lost and hence the original image cannot be retrieved. Lossy compression uses something called transform encoding to compress the data. This leads to the introduction of some artifacts as seen in JPEG images.

For our case lets look at run length encoding.

Example - WWWWWWBBBBBBBWWWWBBBB In run length encoding this is stored as 6W7B4W4B