High-Dimensional Inference Tutorial: PCA, PPCA, Autoencoders

Introduction to High-Dimensional Inference

High-dimensional inference is a cornerstone of modern statistics and machine learning, especially when dealing with data where the number of features (p or q) is large relative to the sample size (n). This tutorial walks you through key techniques—principal component analysis (PCA), probabilistic PCA (PPCA), and autoencoders—using the classic zip.txt dataset of handwritten digits (16x16 images, q=256, n=7291). These methods are essential for dimensionality reduction, visualization, and uncovering latent structure. Whether you're a STAT3006 student or a data enthusiast, mastering these tools will boost your ability to handle complex, high-dimensional data.

Understanding the Dataset: Handwritten Digits

The zip.txt dataset contains 7291 observations, each a 256-dimensional vector representing a 16x16 grayscale image of a digit (0-9). The label Yi indicates the true digit. In Part 1, you select one digit (e.g., y=3) and plot 9 unique images. This step is crucial for data familiarization. Use the following R code to load and visualize:

zip_data <- read.table("zip.txt")
labels <- zip_data[, 1]
images <- as.matrix(zip_data[, -1])
digit <- 3
indices <- which(labels == digit)[1:9]
par(mfrow = c(3, 3))
for (i in indices) {
  img <- matrix(images[i, ], nrow = 16, byrow = TRUE)
  image(img, col = grey(seq(1, 0, length = 256)), axes = FALSE)
}

This visualization gives you an intuitive sense of the data before applying any inference methods.

Principal Component Analysis (PCA) for Dimensionality Reduction

PCA is a fundamental technique for high-dimensional inference. In Part 2, you center the data (subtract the mean) and find the best rank-s approximation (s=4) by minimizing reconstruction error. The solution involves the singular value decomposition (SVD) of the centered data matrix. Here's how to compute it in R:

X_centered <- scale(images, center = TRUE, scale = FALSE)
svd_result <- svd(X_centered)
s <- 4
F_hat <- t(svd_result$v[, 1:s])  # s x q
R_hat <- svd_result$v[, 1:s]      # q x s
reconstructions <- X_centered %*% t(F_hat) %*% t(R_hat)
min_error <- sum((X_centered - reconstructions)^2)
cat("Minimum reconstruction error:", min_error)

This error quantifies how much information is lost by projecting onto 4 principal components. The forward mappings Wi = F_hat * X_centered (Part 3) are 4-dimensional representations, which you can plot colored by digit label. Use the pairs function to visualize pairwise scatterplots:

W <- t(F_hat %*% t(X_centered))
pairs(W[, 1:4], col = labels, pch = 19, cex = 0.5)

You'll likely see clusters forming, indicating that PCA captures some digit structure even in 4D.

Spectral Decomposition and Variance Explained

Part 4 asks for the proportion of total variance explained by the first 4 eigenvectors. Compute the Gram matrix and its eigenvalues:

Gram <- t(X_centered) %*% X_centered
eigen_vals <- eigen(Gram, symmetric = TRUE)$values
total_var <- sum(eigen_vals)
var_explained <- sum(eigen_vals[1:s]) / total_var
cat("Proportion of variance explained by 4 PCs:", var_explained)

Typically, the first few PCs capture a modest fraction (maybe 30-40%) because digits are complex. This highlights the challenge of high-dimensional inference: even with many components, variance is spread out.

Probabilistic PCA (PPCA) with EM Algorithm

PPCA extends PCA by modeling the latent variables as Gaussian and adding noise. In Part 5, you maximize the log-likelihood using an EM algorithm. The parameters are sigma^2, mu, and R (q x s). Here's a simplified EM implementation in R:

n <- nrow(images)
q <- ncol(images)
s <- 4
X <- images
mu <- colMeans(X)
Xc <- scale(X, center = TRUE, scale = FALSE)
# Initialize R randomly
R <- matrix(rnorm(q*s), nrow = q, ncol = s)
sigma2 <- 1
for (iter in 1:100) {
  M <- t(R) %*% R + sigma2 * diag(s)
  invM <- solve(M)
  # E-step: expected latent variables
  Ez <- Xc %*% R %*% invM
  Ezz <- sigma2 * invM + t(Ez) %*% Ez / n
  # M-step
  R_new <- (t(Xc) %*% Ez) %*% solve(Ezz)
  sigma2_new <- sum(diag(t(Xc) %*% Xc - 2 * t(Xc) %*% Ez %*% t(R_new) + n * R_new %*% Ezz %*% t(R_new))) / (n * q)
  R <- R_new
  sigma2 <- sigma2_new
}

Plot the posterior expectations (Part 6) similarly to PCA forward mappings. Compare the two plots: PPCA often gives smoother clusters due to the noise model.

Autoencoders for Nonlinear Dimensionality Reduction

Part 7 introduces a 3-layer autoencoder with a chosen activation function (e.g., ReLU or sigmoid). This is a nonlinear alternative to PCA. In Python using TensorFlow/Keras:

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
# Load data (assuming X is numpy array)
input_dim = 256
encoding_dim = 4
input_layer = layers.Input(shape=(input_dim,))
encoded = layers.Dense(encoding_dim, activation='relu')(input_layer)
decoded = layers.Dense(input_dim, activation='sigmoid')(encoded)
autoencoder = models.Model(input_layer, decoded)
encoder = models.Model(input_layer, encoded)
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(X, X, epochs=50, batch_size=256, shuffle=True, verbose=0)
W = encoder.predict(X)

Plot the 4D encodings colored by labels. You'll likely see more distinct clusters than PCA, as autoencoders capture nonlinear patterns.

Connecting to Current Trends: AI and Image Generation

High-dimensional inference is at the heart of modern AI. For example, diffusion models (like DALL-E 2 and Stable Diffusion) learn to generate images by reversing a noise process—a sophisticated form of high-dimensional inference. The PCA and autoencoder techniques you just practiced are foundational for understanding these state-of-the-art models. Similarly, in genomics (Problem 2), MMD tests help compare distributions of gene expressions between cancer subtypes, a critical task in precision medicine.

Conclusion

This tutorial covered PCA, PPCA, and autoencoders for high-dimensional inference using handwritten digits. These methods are not just academic; they power real-world applications from facial recognition to medical diagnosis. Practice with the zip.txt dataset and extend to the Golub leukemia data (Problem 2) to solidify your understanding. Remember: in high dimensions, always check assumptions, use appropriate regularization, and validate with domain knowledge.