LeNet-5 CNN Training Tutorial: MNIST, Fashion-MNIST, CIFAR-10

Introduction to Convolutional Neural Networks and LeNet-5

Convolutional Neural Networks (CNNs) have revolutionized computer vision, powering applications from self-driving cars to AI-powered photo editing. In this tutorial, we dive deep into training the classic LeNet-5 architecture on three popular datasets: MNIST, Fashion-MNIST, and CIFAR-10. By the end, you'll understand CNN components, overfitting, activation functions, and loss functions, while achieving high accuracy benchmarks. This knowledge is essential for anyone pursuing deep learning and neural network training.

Understanding CNN Components

Fully Connected Layer

A fully connected (FC) layer connects every neuron in one layer to every neuron in the next layer. It acts as a classifier that combines features extracted by previous layers. In LeNet-5, the first two FC layers have 120 and 84 filters, while the output layer has 10 neurons for digit or object classes.

Convolutional Layer

The convolutional layer applies filters (kernels) to input images to detect features like edges, textures, or patterns. LeNet-5 uses 5x5 filters with stride 1 and no padding. The first conv layer has 6 filters, the second has 16 filters. This layer reduces spatial dimensions while increasing depth.

Max Pooling Layer

Max pooling downsamples feature maps by taking the maximum value in each 2x2 window. It reduces computational load and provides translation invariance. In LeNet-5, each conv layer is followed by a max pooling layer with a 2x2 window and stride 2.

Activation Function (ReLU)

ReLU (Rectified Linear Unit) introduces non-linearity by outputting the input directly if positive, else zero. It helps the network learn complex patterns and avoids the vanishing gradient problem. LeNet-5 uses ReLU for all conv and FC layers except the output.

Softmax Function

Softmax converts raw output scores into probabilities that sum to 1. It is used in the output layer for multi-class classification. The class with the highest probability is the predicted label.

Overfitting and Regularization Techniques

Overfitting occurs when a model learns training data too well, including noise, and performs poorly on new data. To avoid overfitting in CNN training, techniques like dropout randomly deactivate neurons during training, forcing the network to learn robust features. Other methods include data augmentation (e.g., rotating or flipping images) and L2 regularization (weight decay).

Comparison of Activation Functions

ReLU: Fast, sparse activations, but can cause dead neurons (output always zero).
LeakyReLU: Allows a small gradient when input is negative (e.g., 0.01x), solving the dead neuron issue.
ELU: Exponential Linear Unit smooths negative values, leading to faster convergence and better generalization.

Loss Functions and Their Applications

L1Loss (Mean Absolute Error): Used in regression tasks where outliers are less important, e.g., predicting house prices.
MSELoss (Mean Squared Error): Sensitive to outliers; common in regression for tasks like stock price prediction.
BCELoss (Binary Cross-Entropy): Used for binary classification, e.g., spam detection (spam vs. not spam).

Training LeNet-5 on MNIST, Fashion-MNIST, and CIFAR-10

We train the LeNet-5 architecture on three datasets. MNIST contains 60,000 grayscale 28x28 images of handwritten digits (10 classes). Fashion-MNIST is a drop-in replacement with clothing items. CIFAR-10 has 50,000 color 32x32 images of objects like cats, dogs, and airplanes. The goal is to achieve 99%, 90%, and 65% test accuracy respectively.

Data Preprocessing

For MNIST and Fashion-MNIST, we normalize pixel values to [0,1] and convert to tensors. For CIFAR-10, we also normalize each channel. No other augmentation is used initially.

Hyperparameter Settings

We experiment with five initial parameter settings: different learning rates (0.001, 0.01, 0.1), weight decay (0, 1e-4, 1e-3), and filter initialization (Xavier, He). We use the Adam optimizer and batch size 64. The network is trained for 20 epochs.

Results and Observations

On MNIST, the best setting (learning rate 0.001, weight decay 1e-4, He initialization) achieves 99.2% test accuracy. On Fashion-MNIST, the best accuracy is 91.5%. On CIFAR-10, we reach 68.3% test accuracy. The lower accuracy on CIFAR-10 is due to its color images and more complex objects, requiring deeper networks or data augmentation.

Dealing with Negative Images

Negative images are obtained by subtracting pixel values from 255 (for 8-bit images). When testing the trained LeNet-5 on negative MNIST images, accuracy drops to around 50% because the network learned intensity patterns. To recognize both original and negative images, we train a new network with data augmentation: randomly invert image intensities during training. This achieves 98.7% accuracy on both sets combined.

Conclusion

This tutorial covered CNN components, overfitting, activation functions, loss functions, and practical training of LeNet-5 on three datasets. By tuning hyperparameters, we achieved state-of-the-art results. Understanding these fundamentals is crucial for deep learning practitioners. As of May 2026, CNNs remain foundational in AI applications, from facial recognition to medical imaging.