I have been educated as an applied mathematician, where SVD was in our DNA! So, quite naturally, when I first heard people, mostly in data science, talked about PCA I got hurt! "Oh! It is SVD, do not call it PCA!" As a matter of fact, I was right, strictly speaking, however, the whole story is more interesting that just a name. Here, I talk about SVD, PCA, their origins, and more!
.
What is SVD (Singular Value Decomposition)?
- Columns of \(V\) give an orthonormal basis for \(\mathbb{R}^m\). So \(V^t\xi\) is a change of basis from the standard basis to the basis given by \(V\).
- \(\Sigma\) has non-zero entries only on its diagonal whose role is to scale each of the components of \(\xi\), already projected to the new basis. It also adapts the dimension, for example by chopping off extra columns (when \(m\leq n\)) or extra rows (when \(m\geq n\)) of \(V^t\xi\).
- \(U\) does a similar projection as of \(V\) but in \(\mathbb{R}^m\)
What is PCA (Principal Component Analysis)?
It was Karl Pearson in 1901 who founded PCA, in a paper titled "On Lines and Planes of Closest Fit to Systems of Pointsin Space.". In fact, he was trying to answer a very interesting question: How to represent "a system of points" in a high-dimensional space \(\mathbb{R}^d\), by the "best-fitting" straight line or plane? He defined the "best" line as the line whose mean squared distance to all points in the smallest. Using the notation of correlation, he changed the problem to finding the ellipsoid (or "ellipsoid of residuals") with the largest axis: So, he found one direction in the data which can best explain the data, and another direction (perpendicular to the first one) which is the worst direction to explain the data. Note that theses direction (or lines) are not necessarily along the Cartesian axes. This method often used to find the "best" low-dimenstional approximation of a high-dimensional data; to ignore the least important data and just to keep the data which really matter.So, how do they relate to each other?