Assignment Chef icon Assignment Chef
All English tutorials

Programming lesson

High-Dimensional Inference in STAT3006: A Tutorial on PCA, Factor Models, and Kernel Methods

Learn key concepts from STAT3006 Assignment 4 on high-dimensional inference, including PCA, factor analysis, autoencoders, and kernel two-sample tests, with practical examples using zip and Golub data sets.

high-dimensional inference STAT3006 assignment 4 PCA tutorial factor analysis autoencoder dimensionality reduction maximum mean discrepancy kernel two-sample test Golub data set analysis zip digit recognition high-dimensional data analysis machine learning for genomics dimensionality reduction techniques nonparametric hypothesis testing student statistics project AI in healthcare analytics Python code examples

Introduction to High-Dimensional Inference

High-dimensional inference is a cornerstone of modern statistics and machine learning, especially when the number of features p is large relative to the sample size n. This tutorial covers core methods from STAT3006 Assignment 4: dimensionality reduction via PCA, factor analysis, autoencoders, and nonparametric two-sample testing using maximum mean discrepancy (MMD). These techniques are widely used in genomics, image analysis, and AI applications like facial recognition and self-driving cars. For instance, just as a self-driving car must compress high-dimensional sensor data to make real-time decisions, you'll learn to reduce 256-pixel images of handwritten digits into 4 latent dimensions.

Problem 1: Dimensionality Reduction on Zip Data

Part 1: Visualizing Digit Images

Select one digit (e.g., y = 3) and plot m = 9 unique 16×16 images from the zip.txt data set. Use matplotlib in Python to display them in a 3×3 grid. This helps you understand the raw data before applying any reduction technique.

import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('zip.txt')
X = data[:, 1:]
y = data[:, 0]
digit = 3
indices = np.where(y == digit)[0][:9]
fig, axes = plt.subplots(3, 3)
for i, ax in enumerate(axes.flat):
    ax.imshow(X[indices[i]].reshape(16,16), cmap='gray')
    ax.axis('off')
plt.show()

Part 2: PCA via Optimization

Center the data: X_tilde = X - X.mean(axis=0). Then solve the optimization problem for s = 4 principal components. The solution F_hat, R_hat minimizes the reconstruction error. In practice, you can use sklearn.decomposition.PCA with n_components=4.

from sklearn.decomposition import PCA
pca = PCA(n_components=4)
W = pca.fit_transform(X_tilde)
reconstruction = pca.inverse_transform(W)
error = np.sum((X_tilde - reconstruction)**2)
print('Reconstruction error:', error)

This error quantifies how much information is lost when compressing to 4 dimensions.

Part 3: Visualizing Forward Mappings

Plot the 4-dimensional W_i using a 2D projection (e.g., first two components) colored by digit label. You'll likely see clusters: digits like 0 and 1 separate well, while 3 and 8 may overlap. Discuss differences: the forward mapping captures variance that distinguishes digits, but some classes remain entangled due to non-linearities.

Part 4: Variance Explained via Spectral Decomposition

Compute the Gram matrix G = X_tilde @ X_tilde.T and its eigenvalues. The proportion of variance explained by the first s = 4 eigenvectors is sum(eigenvalues[:4]) / sum(eigenvalues). For zip data, this is typically around 30-40%, indicating that 4 linear components capture a modest but useful portion of the signal.

Part 5: Factor Analysis via Maximum Likelihood

Fit a factor analysis model with s = 4 latent factors. Use sklearn.decomposition.FactorAnalysis with n_components=4. The parameters μ, R, σ² are estimated by maximizing the log-likelihood. The model assumes X_i = μ + R W_i + ε_i, where W_i ~ N(0, I) and ε_i ~ N(0, σ²I).

from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components=4, random_state=0)
fa.fit(X)
mu = fa.mean_
R = fa.components_.T
sigma2 = fa.noise_variance_.mean()

Part 6: Posterior Expectations of Latent Variables

Compute the posterior mean of W_i given X_i: E[W_i|X_i] = (R^T R + σ² I)^{-1} R^T (X_i - μ). Plot these estimates colored by label. Compare with PCA scores: factor analysis often yields more interpretable latent spaces because it explicitly models noise.

Part 7: Autoencoder for Nonlinear Dimensionality Reduction

Train a 3-layer autoencoder with a chosen activation function (e.g., ReLU) to compress images to s = 4 dimensions. Use tensorflow/keras to define the encoder: W = a(f x + c). Plot the encoded representations. Autoencoders can capture non-linear structure, potentially separating digits better than linear methods.

import tensorflow as tf
from tensorflow.keras import layers, Model
input_dim = 256
encoding_dim = 4
input_img = layers.Input(shape=(input_dim,))
encoded = layers.Dense(encoding_dim, activation='relu')(input_img)
decoded = layers.Dense(input_dim, activation='sigmoid')(encoded)
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(X, X, epochs=50, batch_size=256, shuffle=True, validation_split=0.2)
encoder = Model(input_img, encoded)
W_ae = encoder.predict(X)

Problem 2: Two-Sample Testing on Gene Expression Data

Part 1: Maximum Mean Discrepancy (MMD) Test

Load the Golub data: gene expressions for 72 patients (47 ALL, 25 AML). Use a Gaussian kernel κ(x,y) = exp(-β ||x-y||²) with β = 2^{-28}. Compute the MMD statistic:

def mmd(X, Y, beta):
    K_XX = gaussian_kernel(X, X, beta)
    K_YY = gaussian_kernel(Y, Y, beta)
    K_XY = gaussian_kernel(X, Y, beta)
    m = len(X)
    n = len(Y)
    return (K_XX.sum() - np.trace(K_XX))/(m*(m-1)) + (K_YY.sum() - np.trace(K_YY))/(n*(n-1)) - 2*K_XY.sum()/(m*n)

Compare the observed statistic to the critical value from a permutation test (e.g., 1000 permutations). At α=0.1, you may reject H0, indicating that ALL and AML have different distributions. Discuss power: with only 72 samples, the test may be underpowered for subtle differences, but here the signal is strong.

Part 2: Gene-Wise Marginal Tests

For each gene j, compute a two-sample t-test or Wilcoxon test p-value. After multiple testing correction (e.g., Bonferroni or FDR), identify differentially expressed genes. This is a classic approach in bioinformatics: find genes that distinguish cancer subtypes.

Conclusion

This tutorial covered key techniques from STAT3006 Assignment 4: PCA, factor analysis, autoencoders, and kernel two-sample testing. These methods are essential for high-dimensional data analysis, whether you're analyzing handwritten digits or gene expression profiles. As AI and data science continue to evolve, understanding these foundational tools will help you tackle modern challenges in fields like precision medicine, autonomous systems, and financial modeling.