Machine Vision Assignment II: Counting Pushups with Video Understanding (2026)

Introduction: Why Counting Pushups Is Harder Than It Looks

In May 2026, AI video generation models like Veo-3 and Sora can create stunningly realistic clips, but understanding what's happening in a video remains a major challenge. Counting pushups in a video requires the model to recognize repeated motion, detect body landmarks, and track cycles over time. This assignment asks you to build a model that takes an .mp4 video and outputs an integer: the number of pushups completed. You are encouraged to use AI tools (like Gemini in Colab) to brainstorm and code, but your project report must be your own work. This tutorial will guide you through the key concepts and practical steps to design and train a video-understanding model.

Understanding the Problem: Video Dynamics and Repetition Counting

Unlike image classification, video understanding demands temporal reasoning. A pushup is a cyclic motion: the body lowers and rises. Your model must count how many times this cycle occurs. State-of-the-art models still struggle with such dynamics, so your approach matters more than perfect accuracy. Think of it like analyzing a basketball player's jump shots — you need to detect each jump, not just classify a pose.

Key Challenges

Variability: Different camera angles, lighting, clothing, and body types.
Temporal boundaries: When does a pushup start and end? Should you count half pushups?
Speed: Fast vs slow pushups affect frame sampling.
Partial visibility: The person may move out of frame temporarily.

Approach 1: Pose Estimation + Temporal Analysis

One popular method uses a pose estimator (e.g., MediaPipe or OpenPose) to extract keypoints (shoulders, elbows, wrists, hips) from each frame. Then, you analyze the vertical position of the shoulder or hip over time. A pushup corresponds to a local minimum (lowest point) followed by a local maximum. This is like tracking stock market cycles — you count peaks and troughs. To smooth noise, apply a low-pass filter or use a simple moving average. Then, count the number of completed cycles. This approach is lightweight and interpretable.

Implementation Steps

Extract frames from the video at a fixed rate (e.g., 15 fps).
Run pose estimation on each frame to get shoulder y-coordinates.
Normalize coordinates (e.g., divide by image height) to handle different camera distances.
Apply a median filter to remove outliers.
Detect local minima and maxima using a window-based approach.
Count each descent-ascent pair as one pushup.

Approach 2: 3D Convolutional Neural Network (3D CNN)

For a more end-to-end solution, you can train a 3D CNN on short video clips. Each clip is a 4D tensor: frames × height × width × channels. The model learns spatiotemporal features directly. This is similar to how Sora generates videos — it understands motion patterns. However, training from scratch requires a large dataset. You can use transfer learning from a model pre-trained on something like Kinetics-400 (action recognition) and fine-tune on your pushup dataset. Consider using a lightweight architecture like I3D or C3D to fit in Colab's GPU memory.

Data Preparation

Collect or generate pushup videos. Since you don't have a test set, you can simulate by recording yourself or using augmentation.
Label each clip with the number of pushups (regression) or frame-level labels (e.g., 'up' vs 'down' for a classifier that counts transitions).
Use a sliding window to create overlapping clips.

# Example: loading video frames with OpenCV
import cv2
cap = cv2.VideoCapture('pushup.mp4')
frames = []
while True:
    ret, frame = cap.read()
    if not ret: break
    frame = cv2.resize(frame, (224, 224))
    frames.append(frame)
cap.release()
# frames shape: (num_frames, 224, 224, 3)

Approach 3: Transformer-Based Temporal Model

Inspired by large language models, you can treat frame features as tokens. Use a pre-trained 2D CNN (e.g., ResNet50) to extract feature vectors per frame, then feed the sequence into a Transformer encoder. A regression head outputs the count. This is similar to how GPT-4 processes token sequences. You can also add positional encoding to represent time. This approach is flexible but requires careful tuning of sequence length and attention masks.

Evaluation and Iteration

Your model's performance will be judged on accuracy (mean absolute error) and robustness. Since the assignment values process over raw performance, document your experiments. For each iteration, note what changed (e.g., frame rate, model architecture, data augmentation) and the effect. Use plots like predicted vs actual counts. This is where your project report shines — show your thinking.

Common Pitfalls

Overcounting: Small jitter in pose keypoints may create false minima. Use temporal smoothing.
Undercounting: If pushups are too fast, you may miss frames. Increase frame rate.
Model overfitting: If you only train on one person, it may not generalize. Use data augmentation (flip, crop, brightness).

Using AI Tools Responsibly

You are allowed to use LLMs like Gemini to help write code. For example, you can ask: "Write a Python function to detect local minima in a 1D signal using scipy." But do not let the AI write your report. The report must reflect your understanding. Think of AI as a senior colleague who helps debug, not as an author. If you use AI, cite it (e.g., "I used Gemini to generate the initial skeleton of the data loader").

Deliverables: Code and Report

Your submission includes a PDF project report with public links to two Colab notebooks: one for training, one for inference. The training notebook must run end-to-end ("Run all") and save the model to HuggingFace. The inference notebook loads the model and evaluates on test data (which TAs will replace). Ensure your code is clean and commented. Use semantic HTML-like structure in your report: headings, lists, and code blocks.

Conclusion

Counting pushups is a microcosm of video understanding challenges. By combining pose estimation with temporal logic, or by training a 3D CNN, you'll gain hands-on experience with spatiotemporal modeling. Remember, the goal is to learn — not to achieve perfect accuracy. Document your failures as well as successes. Good luck, and may your pushup count be accurate!