Build a Simple Cancer Classifier: Machine Learning with Python Lists and Dictionaries

Introduction: Machine Learning Meets Healthcare

Machine learning is transforming how we approach medical diagnosis. In this tutorial, you'll build a simple rule-based classifier that predicts whether a tumor is malignant or benign based on ten numerical attributes. This project mirrors a classic CSCI 141 assignment but is updated with modern Python practices. By the end, you'll understand core ML concepts—training, testing, and voting classifiers—while practicing with lists and dictionaries.

Understanding the Problem

We have a dataset of tumor measurements. Each tumor is described by ten attributes like radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. The goal is to predict the class: malignant (M) or benign (B). The key insight: malignant tumors tend to have larger values for these attributes. Our classifier will use the midpoint between the average malignant value and average benign value for each attribute to vote on the class of a new tumor.

Machine Learning Framework

Machine learning uses data to make predictions. Here, we split our labeled data into a training set (80%) and a test set (20%). The training set is used to compute averages and midpoints—the classifier. The test set evaluates accuracy on unseen data.

Training Phase

During training, we calculate the average value for each attribute among malignant tumors and among benign tumors. Then we compute the midpoint for each attribute: midpoint = (avg_malignant + avg_benign) / 2. These ten midpoints form our classifier.

Testing Phase

For a new tumor, we compare each attribute to the corresponding midpoint. If the attribute value is greater than or equal to the midpoint, we cast a vote for malignant. Otherwise, we vote benign. After all ten attributes, if malignant votes >= benign votes, predict malignant; else predict benign.

Data Structure: Lists of Dictionaries

The training data is provided as a list of dictionaries. Each dictionary represents one tumor with keys: ID, radius, texture, perimeter, area, smoothness, compactness, concavity, concave, symmetry, fractal, and class. For example:

{
    'ID': 897880,
    'radius': 10.05,
    'texture': 17.53,
    'perimeter': 64.41,
    'area': 310.8,
    'smoothness': 0.1007,
    'compactness': 0.07326,
    'concavity': 0.02511,
    'concave': 0.01775,
    'symmetry': 0.189,
    'fractal': 0.06331,
    'class': 'B'
}

We'll use the attribute names (excluding ID and class) as keys to compute averages.

Building the Classifier Step by Step

Step 1: Load and Split Data

Assume we have a function make_training_set() that returns a list of dictionaries for the training set. We'll also load a test set similarly.

Step 2: Compute Averages

We'll iterate over the training set, separate tumors by class, and compute the sum and count for each attribute. Then calculate the average for malignant and benign.

def compute_averages(training_set):
    malignant_sums = {attr: 0 for attr in attributes}
    benign_sums = {attr: 0 for attr in attributes}
    malignant_count = 0
    benign_count = 0

    for tumor in training_set:
        if tumor['class'] == 'M':
            malignant_count += 1
            for attr in attributes:
                malignant_sums[attr] += tumor[attr]
        else:
            benign_count += 1
            for attr in attributes:
                benign_sums[attr] += tumor[attr]

    malignant_avg = {attr: malignant_sums[attr]/malignant_count for attr in attributes}
    benign_avg = {attr: benign_sums[attr]/benign_count for attr in attributes}
    return malignant_avg, benign_avg

Step 3: Compute Midpoints

For each attribute, the midpoint is the average of the two class averages.

def compute_midpoints(malignant_avg, benign_avg):
    midpoints = {}
    for attr in attributes:
        midpoints[attr] = (malignant_avg[attr] + benign_avg[attr]) / 2
    return midpoints

Step 4: Classify a Single Tumor

Given a tumor dictionary and the midpoints, count malignant votes.

def classify_tumor(tumor, midpoints):
    malignant_votes = 0
    benign_votes = 0
    for attr in attributes:
        if tumor[attr] >= midpoints[attr]:
            malignant_votes += 1
        else:
            benign_votes += 1
    return 'M' if malignant_votes >= benign_votes else 'B'

Step 5: Evaluate on Test Set

Run the classifier on each tumor in the test set and compute accuracy.

def evaluate(test_set, midpoints):
    correct = 0
    for tumor in test_set:
        prediction = classify_tumor(tumor, midpoints)
        if prediction == tumor['class']:
            correct += 1
    return correct / len(test_set)

Why This Matters: Machine Learning in the Real World

This simple voting classifier is a foundation for more advanced algorithms like decision trees and k-nearest neighbors. In 2026, machine learning is used everywhere: from AI-powered cancer diagnosis to recommendation systems on streaming platforms. The core idea—using historical data to make predictions—is the same. For example, a fantasy football app might use player stats to predict weekly scores, or a finance app might predict stock trends. By mastering this project, you're building skills that apply to countless real-world problems.

Trending Example: Cancer Classification in the Age of AI

Imagine you're building a tool for a hospital. With the rise of AI in healthcare, even simple classifiers can help doctors prioritize cases. For instance, a 2026 viral app called 'MediPredict' uses similar logic to flag high-risk patients. While our classifier is basic, it demonstrates the pipeline: data preparation, training, testing, and deployment. As you advance, you'll learn to use libraries like scikit-learn to build more accurate models, but the fundamentals remain the same.

Complete Code Example

Here's a full Python script that ties it all together:

import csv

attributes = ['radius', 'texture', 'perimeter', 'area', 'smoothness',
              'compactness', 'concavity', 'concave', 'symmetry', 'fractal']

def load_data(filename):
    # Simplified: returns list of dicts
    pass

def compute_averages(training_set):
    # ... as above

def compute_midpoints(mal_avg, ben_avg):
    # ... as above

def classify_tumor(tumor, midpoints):
    # ... as above

def evaluate(test_set, midpoints):
    # ... as above

def main():
    training_set = load_data('cancerTrainingData.txt')
    test_set = load_data('cancerTestData.txt')
    mal_avg, ben_avg = compute_averages(training_set)
    midpoints = compute_midpoints(mal_avg, ben_avg)
    accuracy = evaluate(test_set, midpoints)
    print(f'Accuracy: {accuracy:.2%}')

if __name__ == '__main__':
    main()

Conclusion

You've built a rule-based cancer classifier using Python lists and dictionaries. This project gives you hands-on experience with machine learning fundamentals: training, testing, and a voting mechanism. As you continue, explore how to improve accuracy with weighted votes or more sophisticated algorithms. The skills you've practiced—data manipulation, averaging, and classification—are essential for any data scientist or machine learning engineer.

Key Takeaway: Machine learning isn't magic. It's about using data to make informed guesses. By splitting data into training and test sets, you can evaluate how well your model generalizes to new cases. This simple classifier is your first step into a world of predictive modeling.