Classification with PCA, PLS, QDA, Logistic Regression, Random Forest Tutorial

Introduction

Welcome to this comprehensive tutorial on tackling a challenging classification problem with high-dimensional data. Inspired by a typical assignment in statistical learning (like MAST90138), we'll explore how to handle datasets with many predictors (p = 365) relative to sample size (n = 150). This is a common scenario in modern data science, from genomics to finance. We'll use rainfall data from Australia, where the goal is to classify locations as North (G=0) or South (G=1) based on 365 daily rainfall measurements. This tutorial will guide you through the key steps: understanding why standard classifiers fail, using dimensionality reduction (PCA and PLS), tuning via cross-validation, and comparing with random forests. By the end, you'll have a solid workflow for similar high-dimensional classification tasks.

Why Standard Classifiers Fail on High-Dimensional Data

When p > n, traditional methods like Quadratic Discriminant Analysis (QDA) and logistic regression break down. QDA requires estimating a separate covariance matrix for each class, which is impossible when p > n because the sample covariance matrices are singular. Logistic regression with all predictors suffers from perfect separation or overfitting. In our example, using all 365 predictors leads to numerical instability. Always check the output: for logistic regression, summary() will show huge standard errors or NA values. The lesson: never blindly apply these methods to high-dimensional data without regularization or dimensionality reduction.

Dimensionality Reduction with PCA and PLS

Principal Component Analysis (PCA) finds directions of maximum variance in X, ignoring the response. Partial Least Squares (PLS) finds directions that maximize covariance between X and the response Y (here, the class indicator). Both produce a set of orthogonal components that can be used as new predictors. In R, use prcomp() for PCA and plsr() from the pls package for PLS. Remember to center the data (and scale if needed). The components are linear combinations of original predictors; you can verify by reconstructing them manually using the rotation matrix.

Choosing the Number of Components via Cross-Validation

We need to decide how many components to retain. Use Leave-One-Out Cross-Validation (LOOCV) to evaluate classification error for each number of components (up to 50). For each fold, train the classifier (QDA or logistic) on the training data projected onto the first k components, then predict the left-out observation. Plot the LOOCV error vs. number of components to find the minimum. This approach prevents overfitting and selects a parsimonious model. In our rainfall example, you'll likely find that a small number of components (e.g., 5-15) yields the best performance.

Training Classifiers on Reduced Data

Once you've chosen the optimal number of components (say, k=10 for PCA and k=8 for PLS), train your QDA and logistic classifiers on the training set projected onto those components. Apply the same centering (using training means) to the test set before projecting. Then compute test error. Compare PCA vs. PLS for each classifier. Typically, PLS may outperform PCA because it incorporates the response, but this depends on the data.

Random Forest as a Non-Linear Alternative

Random forests handle high-dimensional data naturally by considering random subsets of predictors at each split. Use the randomForest package with default mtry (sqrt(p) for classification). Choose the number of trees B by monitoring out-of-bag (OOB) error; plot OOB error vs. trees and pick a value where it stabilizes (e.g., 500). Variable importance plots (mean decrease in accuracy and Gini) can reveal which days (predictors) are most discriminative between North and South. In Australia, rainfall patterns differ seasonally; important predictors likely correspond to months with contrasting rainfall between regions.

Comparing All Classifiers

Finally, compare test errors of all five classifiers: Logistic+PCA, Logistic+PLS, QDA+PCA, QDA+PLS, and RF. You'll likely find that RF performs best due to its ability to capture interactions and non-linearities, while QDA+PLS may also do well. The worst performers are likely the classifiers using all predictors (if attempted) or those with poorly chosen components. Always report test error as a percentage of misclassified test samples.

Practical Tips for Your Assignment

Data centering: Use the training set means to center both training and test sets before PCA/PLS projection.
Cross-validation: LOOCV is computationally expensive but feasible for n=150. For larger datasets, use k-fold CV.
Reproducibility: Set a random seed for random forest to ensure consistent results.
Code organization: Use clear comments and avoid superfluous code. Markers appreciate concise, well-structured R scripts.

Conclusion

High-dimensional classification is a common challenge, but with dimensionality reduction and careful validation, you can build effective models. This tutorial covered the pipeline: diagnose why full models fail, reduce dimension via PCA/PLS, tune component count via LOOCV, train classifiers, and compare with random forests. Apply these techniques to your own datasets, whether in climate science, bioinformatics, or marketing analytics. Remember to always validate your choices and interpret results in context.