Programming lesson
Mastering Data Wrangling with Pandas: A Practical Guide for CS7280 Assignment 2
Learn essential pandas techniques for data cleaning, transformation, and analysis to ace your CS7280 Assignment 2. This tutorial covers real-world examples with timely datasets.
Why Pandas Skills Matter in 2026
As we move through 2026, data is the new gold. From AI models training on terabytes of user interactions to sports analytics predicting game outcomes, the ability to wrangle data efficiently is a superpower. In CS7280 Assignment 2, you'll be tested on your ability to clean, transform, and analyze datasets using pandas. This tutorial will walk you through the core concepts you need, using examples inspired by current trends like the 2026 FIFA World Cup qualifiers and the latest AI chatbot benchmarks.
Understanding the Assignment Structure
The assignment provides a Jupyter notebook (A2.ipynb) and a data/ directory with one or more files. Your task is to complete the notebook by writing functions that perform specific data wrangling tasks. The key is to write clean, efficient code and to avoid adding extra cells that might confuse the autograder. Let's break down the typical tasks you'll encounter.
Core Pandas Operations You Must Know
1. Loading and Inspecting Data
First, you'll load your data using pd.read_csv() or pd.read_excel(). For example, if you have a CSV file with World Cup qualifier match data:
import pandas as pd
df = pd.read_csv('data/matches.csv')
df.head()Always inspect your data with .info(), .describe(), and .isnull().sum() to understand missing values and data types.
2. Filtering and Selecting Data
You'll often need to filter rows based on conditions. For instance, to select only matches where the home team scored more than 2 goals:
high_scoring = df[df['home_goals'] > 2]Remember to use .loc and .iloc for label-based and integer-based indexing, respectively.
3. Handling Missing Data
Real-world datasets are messy. You might have missing values in columns like attendance or referee. Use df.dropna() or df.fillna() wisely. For example, fill missing attendance with the median:
df['attendance'].fillna(df['attendance'].median(), inplace=True)4. Grouping and Aggregating
Grouping is essential for summarizing data. Suppose you want the average goals per team across all matches. First, reshape the data to have one row per team per match using pd.melt() or by concatenating home and away data, then group:
home_df = df[['home_team', 'home_goals']].rename(columns={'home_team':'team', 'home_goals':'goals'})
away_df = df[['away_team', 'away_goals']].rename(columns={'away_team':'team', 'away_goals':'goals'})
all_goals = pd.concat([home_df, away_df])
avg_goals = all_goals.groupby('team')['goals'].mean()5. Merging and Joining Datasets
You may need to combine multiple files. For example, a separate file with team rankings. Use pd.merge() to join on a common column like team_name.
rankings = pd.read_csv('data/rankings.csv')
df_merged = pd.merge(df, rankings, left_on='home_team', right_on='team', how='left')Trend-Inspired Example: Analyzing AI Chatbot Performance
Imagine you have a dataset of chatbot responses from the latest LLM benchmark (like the 2026 Chatbot Arena). Each row contains the model name, prompt category, response length, and a human rating. Your task: find the average rating per model per category. This is a classic groupby operation. Let's simulate:
data = {'model': ['GPT-6', 'Claude-4', 'Gemini-3', 'GPT-6', 'Claude-4'],
'category': ['math', 'math', 'math', 'creative', 'creative'],
'rating': [4.5, 4.7, 4.3, 4.8, 4.6]}
df = pd.DataFrame(data)
avg_rating = df.groupby(['model', 'category'])['rating'].mean()
print(avg_rating)This mirrors the type of aggregation you'll do in the assignment.
Common Pitfalls and How to Avoid Them
- Not handling data types: Ensure columns like dates are parsed as datetime using
pd.to_datetime(). - Using chained indexing: Avoid
df[df['col'] > 0]['col2']; use.locinstead. - Forgetting to reset index: After groupby, use
.reset_index()to get a flat DataFrame. - Adding extra cells: The autograder expects only the provided cells and your utility function. Delete any test cells before submission.
Step-by-Step Workflow for the Assignment
- Unzip the file and open
A2.ipynbin Jupyter. - Read the instructions for each cell. They often ask you to define a function.
- Write a utility function if needed (e.g., a function to clean a column). Place it in the designated section.
- Test locally by running all cells. Make sure no errors occur.
- Clean up by deleting any cells you added (except the utility function cell).
- Submit the notebook as is.
Conclusion
Data wrangling with pandas is a fundamental skill for any data scientist or software engineer. By mastering these operations, you'll not only ace CS7280 Assignment 2 but also be prepared for real-world data challenges. Remember to keep your code clean and your notebook free of extra cells. Good luck!