CS7280 Assignment 2 Pandas Tutorial: Data Wrangling in 2026

Why Pandas Skills Matter in 2026

As we move through 2026, data is the new gold. From AI models training on terabytes of user interactions to sports analytics predicting game outcomes, the ability to wrangle data efficiently is a superpower. In CS7280 Assignment 2, you'll be tested on your ability to clean, transform, and analyze datasets using pandas. This tutorial will walk you through the core concepts you need, using examples inspired by current trends like the 2026 FIFA World Cup qualifiers and the latest AI chatbot benchmarks.

Understanding the Assignment Structure

The assignment provides a Jupyter notebook (A2.ipynb) and a data/ directory with one or more files. Your task is to complete the notebook by writing functions that perform specific data wrangling tasks. The key is to write clean, efficient code and to avoid adding extra cells that might confuse the autograder. Let's break down the typical tasks you'll encounter.

Core Pandas Operations You Must Know

1. Loading and Inspecting Data

First, you'll load your data using pd.read_csv() or pd.read_excel(). For example, if you have a CSV file with World Cup qualifier match data:

import pandas as pd
df = pd.read_csv('data/matches.csv')
df.head()

Always inspect your data with .info(), .describe(), and .isnull().sum() to understand missing values and data types.

2. Filtering and Selecting Data

You'll often need to filter rows based on conditions. For instance, to select only matches where the home team scored more than 2 goals:

high_scoring = df[df['home_goals'] > 2]

Remember to use .loc and .iloc for label-based and integer-based indexing, respectively.

3. Handling Missing Data

Real-world datasets are messy. You might have missing values in columns like attendance or referee. Use df.dropna() or df.fillna() wisely. For example, fill missing attendance with the median:

df['attendance'].fillna(df['attendance'].median(), inplace=True)

4. Grouping and Aggregating

Grouping is essential for summarizing data. Suppose you want the average goals per team across all matches. First, reshape the data to have one row per team per match using pd.melt() or by concatenating home and away data, then group:

home_df = df[['home_team', 'home_goals']].rename(columns={'home_team':'team', 'home_goals':'goals'})
away_df = df[['away_team', 'away_goals']].rename(columns={'away_team':'team', 'away_goals':'goals'})
all_goals = pd.concat([home_df, away_df])
avg_goals = all_goals.groupby('team')['goals'].mean()

5. Merging and Joining Datasets

You may need to combine multiple files. For example, a separate file with team rankings. Use pd.merge() to join on a common column like team_name.

rankings = pd.read_csv('data/rankings.csv')
df_merged = pd.merge(df, rankings, left_on='home_team', right_on='team', how='left')

Trend-Inspired Example: Analyzing AI Chatbot Performance

Imagine you have a dataset of chatbot responses from the latest LLM benchmark (like the 2026 Chatbot Arena). Each row contains the model name, prompt category, response length, and a human rating. Your task: find the average rating per model per category. This is a classic groupby operation. Let's simulate:

data = {'model': ['GPT-6', 'Claude-4', 'Gemini-3', 'GPT-6', 'Claude-4'],
        'category': ['math', 'math', 'math', 'creative', 'creative'],
        'rating': [4.5, 4.7, 4.3, 4.8, 4.6]}
df = pd.DataFrame(data)
avg_rating = df.groupby(['model', 'category'])['rating'].mean()
print(avg_rating)

This mirrors the type of aggregation you'll do in the assignment.

Common Pitfalls and How to Avoid Them

Not handling data types: Ensure columns like dates are parsed as datetime using pd.to_datetime().
Using chained indexing: Avoid df[df['col'] > 0]['col2']; use .loc instead.
Forgetting to reset index: After groupby, use .reset_index() to get a flat DataFrame.
Adding extra cells: The autograder expects only the provided cells and your utility function. Delete any test cells before submission.

Step-by-Step Workflow for the Assignment

Unzip the file and open A2.ipynb in Jupyter.
Read the instructions for each cell. They often ask you to define a function.
Write a utility function if needed (e.g., a function to clean a column). Place it in the designated section.
Test locally by running all cells. Make sure no errors occur.
Clean up by deleting any cells you added (except the utility function cell).
Submit the notebook as is.

Conclusion

Data wrangling with pandas is a fundamental skill for any data scientist or software engineer. By mastering these operations, you'll not only ace CS7280 Assignment 2 but also be prepared for real-world data challenges. Remember to keep your code clean and your notebook free of extra cells. Good luck!