EAS

Principal component analysis

Introduction

Principal component analysis or PCA in short is a methord mainly used to reduce the dimensionality of a data set containing a lot of interrelated variables. It is importante to retain as much information as possible regarding the variations in the original data while reducing the dimention. This is achived through computing the principal components or PCs of the data set which by design will accumilate most of the variations in the first few PCs when ordered in the decending order of variance.

Method Through Example

The method to find PCs uses the following operations:

  • Mean of a data set
  • Variance of a data set
  • Co-Variance of a data set
  • Covariance metrix
  • Eigenvalues of a metrix
  • Eigenvectors corresponding to the eigenvalues

    Consider a data set 'A' with the marks of five students in three subjects.

    Input

    Now the mean score in each subject is given as:

    Input

    A general 2x2 covariance matrix is shown below:
    Input

    Similarly a covariance matrix is created for the 3x3 data set 'A' which is given above:

    Input

    The ones shown in blue are the variances and the others are covariances.

    Find the eigenvalues for the above covariance matrix, they come to be:

    Input

    and the corresponding eigenvectors are:
    Input

    The eigenvalues are sorted in the decending order and the first 'N' corresponding eigenvectors are choosen to form the PCA graph which is either dimentionaly lesser or equal to the initial data set. Here 'N' is the dimention of the PCA graph and its elements are optained through the formula:
    Input

    where W' is the transpose of the NxM matrix W, were 'N' is the dimention of the reduced PCA graph and 'M' is the dimention of the initial set.

    Following from the above example, consider the first 2 eigenvalues and its corresponding eigenvectors, we get a 2x3 matrix
    Input

    Thus the PCA graph in this case will be two dimentional, but still retaining most of the information about the variations in the data point, where as the initial data set is in three dimentions.

    For an intiutive understanding, follow the second referance.

    References

    1. The Mathematics Behind Principal Component Analysis by Akash Dubey (Example taken from)
    2. StatQuest with Josh Starmer (link). (For visualization)
    3. CS460, lecture by Dr. Subhankar Mishra, NISER