Implementing PCA from Scratch

Principal Component Analysis in Python and NumPy

March 26, 2024 - Hammad Munir - 7 min read
#PCA#Dimensionality Reduction#Machine Learning#NumPy
R
Algorithmic Fairness

Table of Contents

Last update: March 26, 2024. All opinions are my own.

Implementing PCA from Scratch: A Deep Dive into Dimensionality Reduction

Principal Component Analysis (PCA) is one of the most fundamental techniques in machine learning and data science. While libraries like scikit-learn provide ready-made implementations, understanding how to implement PCA from scratch provides valuable insights into the underlying mathematics and algorithms.

Understanding PCA

PCA is a dimensionality reduction technique that transforms data into a new coordinate system where the greatest variance lies along the first axis (first principal component), the second greatest variance along the second axis, and so on.

Key Concepts:

  • **Principal Components**: Orthogonal directions of maximum variance
  • **Eigenvalues**: Measure of variance explained by each component
  • **Eigenvectors**: Directions of the principal components
  • **Explained Variance Ratio**: Proportion of total variance explained by each component

Mathematical Foundation

Step 1: Data Standardization

First, we standardize the data to have zero mean and unit variance:

X_standardized = (X - mean(X)) / std(X)

Step 2: Covariance Matrix

Compute the covariance matrix of the standardized data:

C = (1/(n-1)) * X^T * X

Step 3: Eigendecomposition

Find the eigenvalues and eigenvectors of the covariance matrix:

C * v = λ * v

Step 4: Principal Components

Sort eigenvectors by eigenvalues in descending order to get principal components.

Implementation in Python

import numpy as np

class PCA:
    def __init__(self, n_components=None):
        self.n_components = n_components
        self.components_ = None
        self.explained_variance_ratio_ = None
        self.mean_ = None
        
    def fit(self, X):
        # Standardize the data
        self.mean_ = np.mean(X, axis=0)
        X_std = (X - self.mean_) / np.std(X, axis=0)
        
        # Compute covariance matrix
        cov_matrix = np.cov(X_std.T)
        
        # Eigendecomposition
        eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
        
        # Sort by eigenvalues
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        # Select number of components
        if self.n_components is None:
            self.n_components = len(eigenvalues)
        
        self.components_ = eigenvectors[:, :self.n_components]
        self.explained_variance_ratio_ = eigenvalues[:self.n_components] / np.sum(eigenvalues)
        
        return self
    
    def transform(self, X):
        X_std = (X - self.mean_) / np.std(X, axis=0)
        return X_std @ self.components_

Applications and Use Cases

Data Visualization:

PCA is commonly used to reduce high-dimensional data to 2D or 3D for visualization purposes.

Feature Engineering:

Creating new features that capture the most important information in the data.

Noise Reduction:

Removing components with low variance can help reduce noise in the data.

Data Compression:

Reducing the dimensionality while retaining most of the information.

Advanced Considerations

Kernel PCA:

For non-linear relationships, kernel PCA can be used to map data to higher dimensions before applying PCA.

Incremental PCA:

For large datasets that don't fit in memory, incremental PCA processes data in batches.

Sparse PCA:

When you want principal components with sparse loadings, sparse PCA can be more interpretable.

Performance Optimization

Computational Complexity:

The eigendecomposition step has O(n³) complexity, which can be expensive for large datasets.

Memory Usage:

For very large datasets, consider using randomized PCA or incremental approaches.

Numerical Stability:

Use SVD instead of eigendecomposition for better numerical stability in some cases.

Validation and Testing

Reconstruction Error:

Measure how well the original data can be reconstructed from the reduced representation.

Cross-Validation:

Use cross-validation to determine the optimal number of components.

Explained Variance:

Monitor the cumulative explained variance to ensure you're retaining sufficient information.

Future Directions

PCA continues to evolve with new techniques like:

  • Non-negative Matrix Factorization (NMF)
  • Independent Component Analysis (ICA)
  • t-SNE and UMAP for non-linear dimensionality reduction
  • Autoencoders for deep learning-based dimensionality reduction

Understanding PCA from scratch provides a solid foundation for these more advanced techniques and helps develop intuition about dimensionality reduction in general.

Blog on ML, AI & other acronyms. All opinions are my own.

© 2020-2025 Hammad Munir