Mar 4, 2021

PCA or SVD? Are they really different?

I have been educated as an applied mathematician, where SVD was in our DNA! So, quite naturally, when I first heard people, mostly in data science, talked about PCA I got hurt! "Oh! It is SVD, do not call it PCA!" As a matter of fact, I was right, strictly speaking, however, the whole story is more interesting that just a name. Here, I talk about SVD, PCA, their origins, and more!
.

What is SVD (Singular Value Decomposition)?

SVD is a matrix factorization in which the rectangular matrix \(X\in\mathbb{R}^{m\times n}\) is decomposed as the multiplication of two square unitary matrices \(U\in\mathbb{R}^{m\times m}\) and \(V\in\mathbb{R}^{n\times n}\) and the diagonal matrix \(\Sigma\in\mathbb{R}^{m\times n}\): \[\boxed{X = U\,\Sigma\,V^t}\] The singular values are really important as their absolute value implies the importance of the corresponding columns of \(V\). Noticing that this matrix \(X\) can be seen as a linear transformation from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). So, when we have the matrix-vector multiplication \(X\xi\), the role of each term can be interpreted as below:
  • Columns of \(V\) give an orthonormal basis for \(\mathbb{R}^m\). So \(V^t\xi\) is a change of basis from the standard basis to the basis given by \(V\).
  • \(\Sigma\) has non-zero entries only on its diagonal whose role is to scale each of the components of \(\xi\), already projected to the new basis. It also adapts the dimension, for example by chopping off extra columns (when \(m\leq n\)) or extra rows (when \(m\geq n\)) of \(V^t\xi\).
  • \(U\) does a similar projection as of \(V\) but in \(\mathbb{R}^m\)
If you would like to have more detailed explanation, you can simply consult this great course of Gilbert Strang:
SVD has been derived by Beltrami (1873) and Jordan (1874), for bilinear forms.

What is PCA (Principal Component Analysis)?

It was Karl Pearson in 1901 who founded PCA, in a paper titled "On Lines and Planes of Closest Fit to Systems of Pointsin Space.". In fact, he was trying to answer a very interesting question: How to represent "a system of points" in a high-dimensional space \(\mathbb{R}^d\), by the "best-fitting" straight line or plane? He defined the "best" line as the line whose mean squared distance to all points in the smallest. Using the notation of correlation, he changed the problem to finding the ellipsoid (or "ellipsoid of residuals") with the largest axis:
So, he found one direction in the data which can best explain the data, and another direction (perpendicular to the first one) which is the worst direction to explain the data. Note that theses direction (or lines) are not necessarily along the Cartesian axes.

This method often used to find the "best" low-dimenstional approximation of a high-dimensional data; to ignore the least important data and just to keep the data which really matter.

So, how do they relate to each other?