ML Fundamentals & California Housing Regression Tutorial 2026

Introduction: Why ML Fundamentals Matter in 2026

In 2026, machine learning is everywhere—from AI-powered gaming NPCs to real-time stock trading apps. Understanding the difference between training error and generalization error is crucial for building models that work in the real world. This tutorial will walk you through key concepts from Homework 1 ML 80629A, using the California housing dataset and text classification examples. Whether you're a student or a self-learner, these fundamentals will help you avoid overfitting and improve model performance.

1. Training Error vs. Generalization Error

Training error is the error your model makes on the data it was trained on. Generalization error is the error on new, unseen data. In practice, we estimate generalization error using a validation set or cross-validation. A common pitfall is data leakage—for example, using test set information to train the model, which artificially lowers validation error but hurts real-world performance.

Example: Imagine training a model to predict student grades. If you include future exam results in your training data, your model will seem perfect but fail on actual new exams.

Evaluating Generalization Error

Use a held-out test set or k-fold cross-validation. Always shuffle data and ensure no overlap between training and validation folds. Beware of overfitting: a model that memorizes noise instead of learning patterns.

2. The Danger of Using Test Set Labels for Training

Suppose you train a model, then use it to label an unlabeled test set, and retrain on both original training data and the newly labeled test data. Would validation error decrease? Likely no, because the test set labels are not ground truth—they are model predictions. This introduces bias and does not add new information. The validation error may even increase due to overfitting to noisy pseudo-labels.

3. Regularizing K-NN: The Bias-Variance Tradeoff

K-NN has no explicit regularization, but you can regularize by increasing K (number of neighbors). Larger K reduces variance but increases bias. This is analogous to smoothing: a larger neighborhood averages more points, making the decision boundary simpler. In 2026, think of it like a game's difficulty setting—higher K is like playing with training wheels (less variance, more bias).

4. K-NN for Document Classification with Bag-of-Words

Yes, you can use K-NN on bag-of-words features. A sensible distance function is cosine distance (1 - cosine similarity). Cosine distance ignores document length and focuses on the angle between vectors, which works well for sparse text data. Alternatively, Euclidean distance can be used but may be dominated by document length.

5. Pros and Cons of Larger K in K-Fold Cross-Validation

Advantages: Lower variance in performance estimates, more stable models. Disadvantages: Higher computational cost (more training runs), and if K is too large (e.g., K=N), each fold is nearly the same, reducing the benefit of cross-validation. In practice, K=5 or 10 is common.

6. Exploring the California Housing Dataset

Load the dataset from sklearn.datasets.fetch_california_housing. Use .describe() to see statistics. You'll notice attributes like median income, house age, and average rooms have different scales. Histograms may show skewed distributions (e.g., median income is right-skewed). This is important for preprocessing.

Linear Regression with 10-Fold CV

Perform 10-fold cross-validation (shuffle=True, random_state=20160202) with LinearRegression. Report the mean MSE on validation sets. Typical MSE values are around 0.5-0.7 (in units of $100,000^2).

7. Lasso and Ridge: What Do They Do?

One word: Regularization. Lasso (L1) and Ridge (L2) add penalties to the loss function to prevent overfitting. Lasso can shrink some coefficients to zero (feature selection), while Ridge shrinks them toward zero but keeps all features.

Comparing Coefficients

After 10-fold CV, compare the coefficients (coef_) of LinearRegression, Lasso, and Ridge. You'll observe that Lasso sets some coefficients to exactly zero, Ridge shrinks all coefficients, and LinearRegression may have large coefficients due to multicollinearity. The average MSE for Lasso and Ridge is often slightly higher than linear regression on training data but lower on validation data due to reduced overfitting.

8. Text Classification with Naive Bayes and Neural Networks

For sentiment analysis on movie reviews, you'll implement Bag-of-Words (BoW) and TF-IDF features. Use sklearn's CountVectorizer and TfidfVectorizer with max_features=10000.

Naive Bayes

Use MultinomialNB for BoW and TF-IDF. Report validation accuracy (typically 80-85%). Extract top 5 positive and negative words based on class log probabilities.

Neural Networks

Use MLPClassifier with early stopping. Tune hyperparameters: hidden layer sizes [4,8,16] for first layer and [0,4,8] for second, learning rate [0.1,0.01,0.001], L2 penalty [0.001,0.01,0.1]. Set random_state=12345. The best combination often involves a single hidden layer of 16 neurons, learning rate 0.01, and L2=0.001, achieving ~85% validation accuracy.

9. Comparison and Best Practices

Majority voting baseline accuracy is 50% (since classes are balanced). For BoW, Neural Network usually outperforms Naive Bayes (e.g., 85% vs 83%). For TF-IDF, Naive Bayes may catch up. The best feature representation depends on the model: Neural Networks benefit from TF-IDF's scaling, while Naive Bayes works well with raw counts. Overall, Neural Networks with TF-IDF tend to be the best performing model due to their ability to learn complex patterns.

Conclusion

Mastering these ML fundamentals will serve you well in 2026's data-driven world. Remember to always separate training and test data, use cross-validation, and regularize appropriately. Whether you're building a model for a gaming AI or a financial app, these principles remain the foundation.