Week-2
1. Issues with PCA 2. Addressing complexity (XXT and XTX) 3. Addressing non-linearity (Feature Transformation) 4. Kernels 5. Kernel PCA 6. Kernel Centering 1. Issues with PCAComplexity O(d3) Problem when dn Non-linearity PCA assumes that data lies in a linear subspace. 2. Addressing complexity (XXT and XTX) XXT and XTX C=1
n
XXT
Gram matrix K=XTX KRn×n a
K=a
-xT1-
-xTn-
a
| |
x1xn
| |
Kij=xTixj Properties XXT and XTX are positive semi-definite (both have non-negative eigenvalues) XXT and XTX have the same non-zero eigenvalues rank(XTX)=rank(XXT)=rank(X)=r 𝜆1⩾⋯⩾𝜆r>0 If (𝜆i,vi) is an eigenpair of K with ||vi||=1 (𝜆i
n
,Xvi
𝜆i
)
is an eigenpair of C
wi=Xvi
𝜆i
is the ith PC of C
Complexity in this case is O(n3) 3. Addressing non-linearity (Feature Transformation) 𝜙:RdRD Example of a polynomial transformation 𝜙(a
x1
x2
)=a
1
x1
x2
x21
x22
x1x2
Transformed data-matrix 𝜙(X)RD×n 𝜙(X)=a
| |
𝜙(x1)𝜙(xn)
| |
Transformed dataset might be linear in the transformed feature space. PCA can be run on this transformed dataset in RD. But explicit transformations can be hard. Kernels help here. If there are a lot of features that you are adding, then Dn, so this would take us back to issue-1 (complexity). 4. Kernels Kernel measures the similarity between data-points in the transformed space. k:Rd×RdR k(x,y)=𝜙(x)T𝜙(y) Polynomial kernel of degree p k(x,y)=(1+xTy)p The transformation corresponding to this maps to a space RD where: D=a
p+d
d
Gaussian kernel k(x,y)=exp(-||x-y||2
2𝜎2
)
1D example for x=0,𝜎=1 Kernel matrix For a dataset D={x1,,xn} KRn×n Kij=k(xi,xj) Mercer's Theorem A kernel k:Rd×RdR is valid if and only if:k is symmetricFor any set of data-points {x1,,xn}, the kernel matrix K is symmetric and positive semi-definite. 5. Kernel PCA Kernel-PCA(D, k) Compute the kernel matrix K using the kernel k Let (𝜆i,vi) be an eigenpair of K with 𝜆i>0 and ||vi||=1 If r is the rank of K, there are r non-zero eigenvalues. 𝜆1⩾⋯⩾𝜆r>0 Form the following matrices: D=a
1
𝜆1
1
𝜆r
, DRr×r
V=a
| |
v1vr
| |
, VRn×r
The (scalar) projection of the data-points in the transformed space is given by: XRr×n X=DVTK 6. Kernel Centering 𝜙:RdRD 1n×n=1
n
a
1
𝜙c(X)=𝜙(X)-𝜙(X)1n×n Covariance matrix of transformed dataset a
C=1
n
𝜙c(X)𝜙c(X)T
Let k be a kernel corresponding to the transformation 𝜙: k:Rd×RdR k(x,y)=𝜙(x)T𝜙(y) Kernel matrix K=𝜙(X)T𝜙(X) Centered kernel matrix a
Kc=𝜙c(X)T𝜙c(X)
Kc=K-K1n×n-1n×nK+1n×nK1n×n We now replace K with Kc in the kernel-PCA algorithm.