Assignment Chef icon Assignment Chef
All English tutorials

Programming lesson

Mastering MDPs for CS7641 Assignment 4: Value Iteration, Policy Iteration, and RL in Practice

A comprehensive guide to Markov Decision Processes for CS7641 Assignment 4, covering small and large MDP design, value iteration vs policy iteration convergence, and reinforcement learning algorithms with timely examples from gaming and AI.

CS7641 assignment 4 Markov Decision Process tutorial value iteration vs policy iteration MDP convergence analysis reinforcement learning algorithms small vs large MDP exploration strategies epsilon-greedy UCB RL model-free vs model-based MDP design examples CS7641 analysis writeup policy iteration convergence speed value iteration iterations Q-learning for MDP sequential decision making AI

Introduction to Markov Decision Processes in CS7641

Markov Decision Processes (MDPs) form the backbone of sequential decision-making in reinforcement learning. In this tutorial, we explore how to approach the CS7641 Assignment 4, which asks you to design two MDPs—one small and one large—and solve them using value iteration, policy iteration, and a reinforcement learning algorithm of your choice. By the end, you'll understand convergence behavior, state-space effects, and the trade-offs between model-based and model-free methods.

Why MDPs Matter: From Gaming to AI Assistants

Think of an MDP as a game where an agent makes decisions to maximize rewards over time. This mirrors how AI in popular games like League of Legends or Dota 2 learns to navigate complex environments. Similarly, recommendation systems (e.g., TikTok's feed) use MDP-like models to keep users engaged. By mastering MDPs, you're not just solving an assignment—you're learning skills used in cutting-edge AI applications.

Designing Interesting MDPs for the Assignment

Your assignment requires two MDPs: one with a small number of states (e.g., under 50) and one with a large number (e.g., 500+). Avoid grid worlds—instead, think of processes you care about.

Small MDP Example: Inventory Management for a Viral Product

Imagine you're managing stock for a limited-edition sneaker drop (like the latest Nike collaboration). States represent inventory levels (0 to 10 pairs). Actions are ordering 0, 1, or 2 pairs. Rewards come from sales minus holding costs. This MDP is small (11 states) but interesting because it models real-world supply-demand dynamics.

Large MDP Example: Social Media Engagement Optimization

Consider a content creator on YouTube trying to maximize watch time. States could be combinations of video topic, time of day, and audience mood (10 topics × 24 hours × 5 moods = 1200 states). Actions are choosing a video style (tutorial, vlog, review). Rewards are watch time. This large MDP reflects how algorithms like YouTube's recommendation system work.

Solving MDPs with Value Iteration and Policy Iteration

Both algorithms find the optimal policy. Value iteration updates the value function until convergence; policy iteration alternates between policy evaluation and improvement.

Defining Convergence

You might define convergence as when the maximum change in value across states is less than a threshold (e.g., 1e-6). For the small MDP, value iteration converges in about 50 iterations; policy iteration in 3-5 iterations. For the large MDP, value iteration may take 200+ iterations, while policy iteration still converges in fewer than 10 iterations—but each iteration is more expensive.

Why Does Policy Iteration Converge Faster?

Policy iteration directly improves the policy, leading to fewer overall iterations. However, each policy evaluation requires solving a system of linear equations. In practice, for small MDPs, both are fast; for large MDPs, value iteration may be simpler to implement but slower to converge.

Reinforcement Learning: Model-Free MDP Solving

Now, pick your favorite RL algorithm—Q-learning or SARSA are common choices. Since you don't know the transition probabilities or rewards, the agent must explore.

Exploration Strategies: Epsilon-Greedy vs. UCB

Epsilon-greedy (e.g., epsilon=0.1) is simple: 10% of the time, take a random action. Upper Confidence Bound (UCB) selects actions based on uncertainty, often leading to faster convergence. In our inventory MDP, epsilon-greedy works well; for the social media MDP, UCB might better handle the large state space.

Performance Comparison

With a known model, value and policy iteration find the exact optimal policy. RL algorithms find near-optimal policies after sufficient training—often requiring thousands of episodes for the large MDP. The gap highlights the value of model-based methods when the model is available.

Practical Tips for Your Analysis Writeup

Your analysis (max 8 pages) should include graphs of convergence (e.g., value function change per iteration) and a table comparing iteration counts. Use clear labels and avoid tiny fonts. Discuss why you defined convergence as you did and how state size affected performance.

Connecting to Trends: AI in 2026

In 2026, reinforcement learning powers autonomous vehicles and real-time bidding systems. Your assignment mirrors these applications—small MDPs for simple tasks (like parking a car) and large MDPs for complex ones (like navigating a city). Understanding convergence trade-offs is crucial for deploying RL in production.

Conclusion

By designing two MDPs and solving them with value iteration, policy iteration, and RL, you gain hands-on experience with core concepts of sequential decision-making. Focus on clear analysis and original MDPs—avoid copying grid worlds. Good luck with CS7641 Assignment 4!