Programming lesson
Building a Fixed-Depth Decision Tree from Scratch: A Step-by-Step ID3 Implementation Guide
Learn to implement a fixed-depth decision tree using the ID3 algorithm from scratch. This tutorial covers entropy, information gain, recursion, and depth limiting, with examples from the MONK's problems dataset. Perfect for CSCI 184 students and machine learning beginners.
Introduction: Why Build a Decision Tree from Scratch?
Decision trees are one of the most intuitive and widely used machine learning algorithms. They power everything from credit scoring in fintech to recommendation engines in streaming apps like Netflix. In this tutorial, you'll learn to implement a fixed-depth decision tree using the ID3 algorithm from scratch—no scikit-learn allowed! We'll use the classic MONK's problems dataset, a benchmark from the 1990s that remains relevant for understanding decision tree learning.
By the end, you'll be able to generate learning curves, visualize trees, and compare your implementation with scikit-learn's DecisionTreeClassifier. This guide is tailored for CSCI 184 homework 1 part 2 but also serves as a solid foundation for anyone diving into machine learning from scratch.
Understanding the ID3 Algorithm
The ID3 (Iterative Dichotomiser 3) algorithm builds a decision tree by selecting the attribute that best splits the data at each node. The best split is measured by information gain, which is based on entropy. Entropy quantifies the impurity or uncertainty in the data. For a binary classification problem (like MONK's problems), entropy is calculated as:
Entropy(S) = -p+ log2(p+) - p- log2(p-)where p+ and p- are the proportions of positive and negative examples in the set S. Information gain is the reduction in entropy after splitting on an attribute:
Gain(S, A) = Entropy(S) - sum_{v in Values(A)} (|Sv|/|S|) * Entropy(Sv)At each node, we compute the gain for all attributes and choose the one with the highest gain. Then we recursively split the data on that attribute's values until we reach a stopping criterion—in our case, a maximum depth.
Implementing the Fixed-Depth Decision Tree
We'll implement a Python class DecisionTreeID3 with methods for fitting and predicting. The key is to limit the tree depth by passing a max_depth parameter. Here's the skeleton:
class DecisionTreeID3:
def __init__(self, max_depth=None):
self.max_depth = max_depth
self.tree = None
def fit(self, X, y, depth=0):
# Base cases: pure node, no attributes, or max depth reached
if len(set(y)) == 1 or X.shape[1] == 0 or (self.max_depth is not None and depth == self.max_depth):
return {'class': y.mode()[0]}
# Find best attribute
best_attr = self._best_attribute(X, y)
tree = {'attr': best_attr, 'children': {}}
for value in X[best_attr].unique():
subset_X = X[X[best_attr] == value].drop(columns=[best_attr])
subset_y = y[X[best_attr] == value]
if subset_X.empty:
tree['children'][value] = {'class': y.mode()[0]}
else:
tree['children'][value] = self.fit(subset_X, subset_y, depth+1)
return treeThis recursive function builds a tree dictionary. Note that we stop early if the node is pure, no attributes remain, or we've reached max_depth. The _best_attribute method calculates information gain for each attribute and returns the best one.
Calculating Entropy and Information Gain
We need helper functions for entropy and gain:
def entropy(y):
proportions = y.value_counts(normalize=True)
return -sum(p * np.log2(p) for p in proportions if p > 0)
def information_gain(X, y, attr):
total_entropy = entropy(y)
values = X[attr].unique()
weighted_entropy = 0
for v in values:
subset_y = y[X[attr] == v]
weighted_entropy += (len(subset_y) / len(y)) * entropy(subset_y)
return total_entropy - weighted_entropyThese functions are the heart of the ID3 algorithm. Make sure to handle edge cases like empty subsets.
Handling the MONK's Problems Dataset
The MONK's problems have six categorical attributes (columns 2–7) and a binary class label (column 1). The dataset is clean and perfect for testing decision trees. Load the data with pandas:
import pandas as pd
df_train = pd.read_csv('monks-1.train', header=None, sep=' ')
df_test = pd.read_csv('monks-1.test', header=None, sep=' ')
X_train = df_train.iloc[:, 1:7] # attributes
y_train = df_train.iloc[:, 0] # labels
X_test = df_test.iloc[:, 1:7]
y_test = df_test.iloc[:, 0]Remember to convert to appropriate types if needed. The attributes are integers from 1 to 4, but they are nominal, so treat them as categories.
Part a: Learning Curves for Depths 1 to 10
For each MONK's problem (monks-1, monks-2, monks-3), you need to compute average training and test errors for depths 1 through 10. Since the dataset is small, you can run multiple trials with different random seeds? Actually, the assignment likely expects a single train/test split provided. But to get average errors, you might need to do cross-validation or multiple runs. The prompt says "average training and test errors"—probably over multiple runs with different train/test splits? Check your assignment details. For simplicity, we'll use the provided train/test files as is, but you can also implement k-fold cross-validation.
Here's the loop:
depths = range(1, 11)
train_errors = []
test_errors = []
for d in depths:
tree = DecisionTreeID3(max_depth=d)
tree.fit(X_train, y_train)
y_pred_train = tree.predict(X_train)
y_pred_test = tree.predict(X_test)
train_errors.append(1 - accuracy_score(y_train, y_pred_train))
test_errors.append(1 - accuracy_score(y_test, y_pred_test))Plot the curves using matplotlib. Expect that training error decreases with depth, while test error may increase due to overfitting.
Part b: Visualizing Trees and Confusion Matrices for Depths 1, 3, 5
For monks-1 only, you need to visualize the learned decision tree and show confusion matrices. The provided skeleton has a render_dot_file() function that converts your tree dictionary to a DOT file, then to PNG. Alternatively, you can manually print the tree structure. For confusion matrix, use scikit-learn's confusion_matrix:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_test)
print(cm)Interpret the matrix: true positives, false positives, etc. For depth=1, the tree is a stump; for depth=3, it's more complex; for depth=5, it may overfit.
Part c: Using scikit-learn's DecisionTreeClassifier
Now repeat part b but using scikit-learn's implementation with criterion='entropy':
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
# Visualize tree using sklearn.tree.plot_tree or export_graphvizCompare the trees and confusion matrices with your custom implementation. They should be similar but may differ due to tie-breaking or attribute ordering.
Common Pitfalls and Tips
- Depth counting: Ensure you start counting depth from 0 (root) or 1 consistently. The assignment says depth=1 means a tree with one split? Typically depth is the number of splits from root to leaf. Check your interpretation.
- Missing attribute values: MONK's problems have no missing values, so you can ignore that.
- Numerical attributes: All attributes are categorical, so no need for discretization.
- Overfitting: At depth 10, the tree will likely memorize the training data, leading to poor test performance.
Connecting to Real-World Trends
Decision trees are not just academic; they're used in AI chatbots like ChatGPT for intent classification, in finance for loan approval, and in gaming for NPC behavior trees. The MONK's problems themselves are a nod to the 1990s AI boom, but the principles still apply in modern machine learning pipelines. As you implement this, think of it as building a simple recommendation system for a streaming app: "If user likes action movies and watched 'Avengers', recommend 'Iron Man'"—that's a decision tree!
Conclusion
You've now built a fixed-depth decision tree from scratch and compared it with scikit-learn. This exercise solidifies your understanding of information theory and recursive partitioning. For your CSCI 184 assignment, make sure to generate the required plots and visualizations. Good luck, and remember: the best way to learn is to code it yourself!