nn∑i=1xiIf the dataset is already centered, ⏨x=0. If ⏨x≠0, do the following:x′i=xi-⏨xXc=a
|
|
x′1
⋯
x′n
|
|
Xc is the centered data-matrix.
Remark: From now we will work only with the centered data-matrix and will be calling it X (the subscript c will be dropped)
2. Covariance matrixC=a
1
-0.9
-0.9
1
ShapeC∈Rd×dOuter-product formC=1
nn∑i=1xixTiMatrix-formC=1
nXXTScalar formCpq=1
n∑xipxiqCpq captures the covariance between the pth feature and the qth feature. As a special case:Cpp=1
n∑x2ipCpp captures the variance of the pth feature.Properties• CT=C• All eigenvalues of C are non-negative.– 𝜆1⩾⋯⩾𝜆d⩾0• There is an orthonormal basis for Rd made up of eigenvectors of C– {w1,⋯,wd}– This comes from the spectral theorem.
Note: If C is a square matrix, then (𝜆,w) is said to be an eigenvalue-eigenvector pair if Cw=𝜆w. Note that w≠0 for it to be an eigenvector.
Remark: wi will always represent a unit-norm vector in the rest of the document.
3. Optimization problemwMinimizing the reconstruction error
min
w1
nn∑i=1||xi-(xTiw)w||2Maximizing the variance
max
wwTCwBoth forms are equivalent to each other.4. Principal componentsw1w2Let (𝜆1,w1),⋯,(𝜆d,wd) be the eigen-pairs of C, where 𝜆1⩾⋯⩾𝜆d and {w1,⋯,wd} is an orthonormal basis for Rd. wi is termed the ith principal component of C. To be more precise:Cwi=𝜆iwiwTiwj=a
1,
i=j
0,
i≠j
𝜆1=
max
wwTCww1=
argmax
wwTCwwT1Cw1=𝜆15. Projectionswx(xTw)wx-(xTw)w(Vector) Projection of xi onto the jth PC(xTiwj)wjScalar projection of xi onto the jth PC (or) coordinate of the data-point along this direction:xTiwjThe projection of a data-point xi onto the top k principal components:x′i=(xTiw1)w1+⋯+(xTiwk)wkTo represent the reconstruction and scalar projections in matrix form:W∈Rd×kW=a
|
|
w1
⋯
wk
|
|
Scalar projectionsX′∈Rk×nX′=a
xT1w1
xTnwk
|
⋯
|
xT1wk
xTnwk
X′=WTXReconstructionX′∈Rd×nX′=WWTX6. Reconstruction error revisited (for k directions)w||xi-(xTiw)w||21
nn∑i=1||xi-x′i||21
nn∑i=1aaxi-k∑j=1(xTiwj)wjaa27. Variance capturedTotal variance:𝜆1+⋯+𝜆dVariance along a given direction w (unit vector):w(xT2w)(xT1w)(xT3w)(xT4w)1
nn∑i=1(xTiw)2wTCwProportion of variance captured by top k PCs:𝜆1+⋯+𝜆k
𝜆1+⋯+𝜆dHeuristic to choose the value of k: smallest value that captures 95% of the variance in the dataset.8. CompressionReconstructionnk+dk