Programming lesson
Mastering Reinforcement Learning: Sarsa, Q-Learning, and Dyna-Q in a Cliff-Walking Environment
Learn how to implement Sarsa, Q-Learning, and Dyna-Q agents in a cliff-walking gridworld. Understand model-free vs model-based RL with practical code examples and visualization tips.
Introduction to Reinforcement Learning in the Cliff-Walking Environment
Reinforcement Learning (RL) is a powerful branch of machine learning where an agent learns to make decisions by interacting with an environment. In this tutorial, we focus on the classic cliff-walking problem—a grid-shaped maze where the agent must find a safe path to the goal while avoiding a cliff. This assignment (AI3603 Homework 3) asks you to implement Sarsa, Q-Learning, and Dyna-Q algorithms. We'll walk through the key concepts, provide code snippets, and offer tips for analysis. By the end, you'll understand the differences between model-free and model-based RL and be ready to ace your homework.
Understanding the Cliff-Walking Environment
The environment is a 12×4 grid. The start is at the bottom-left corner, and the goal (exit) is at the bottom-right. The agent can move up, down, left, or right. Stepping into the cliff (the bottom row excluding start and goal) yields a reward of -100 and resets the agent to the start. Each step costs -1 (living cost). The episode ends when the agent reaches the goal. The state is an integer representing the (x,y) coordinate, and actions are {0,1,2,3} for four directions.
This setup mimics real-world scenarios like navigating a risky path in a video game or an autonomous robot avoiding hazards. For instance, think of a viral game like Frogger where crossing a road safely is similar to avoiding the cliff.
Implementing Sarsa, Q-Learning, and Dyna-Q
All three algorithms are tabular methods that learn a Q-value function Q(s,a). The key difference lies in how they update Q values and whether they use a model of the environment.
- Sarsa (on-policy): Updates Q(s,a) using the actual next action a' chosen by the current policy (ε-greedy). It's more conservative because it learns the value of the policy it's following.
- Q-Learning (off-policy): Updates Q(s,a) using the maximum Q(s',a') over all actions, independent of the actual next action. This is more aggressive and can learn optimal policy even while exploring.
- Dyna-Q (model-based): In addition to Q-learning updates, it builds a model of the environment (transition and reward) from experience and uses planning steps to update Q values from simulated experiences. This can speed up learning.
In your agent.py, you need to implement ε-greedy action selection with ε decay. For example, start ε at 1.0 and decay it by a factor each episode until it reaches a minimum like 0.01. Here's a snippet:
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.choice(self.n_actions)
else:
return np.argmax(self.q_table[state])
# After each episode, decay epsilon
self.epsilon = max(self.epsilon * self.epsilon_decay, self.epsilon_min)Training Scripts: cliff_walk_sarsa.py, cliff_walk_qlearning.py, cliff_walk_dyna_q.py
Each script should create the environment, instantiate the agent, and run training loops. For Sarsa, the update rule is:
Q(s,a) += lr * (r + gamma * Q(s',a') - Q(s,a))For Q-Learning:
Q(s,a) += lr * (r + gamma * max_a' Q(s',a') - Q(s,a))For Dyna-Q, after each real step, perform n planning steps using randomly sampled state-action pairs from the model. The model stores (s,a) -> (s',r).
Don't forget to set hyperparameters: learning rate (e.g., 0.5), gamma (0.9), epsilon decay schedule. Experiment with different values to see how they affect convergence.
Visualizing Results
Your report should include three plots:
- Episode reward during training: Plot the total reward per episode. A good agent should eventually achieve high rewards (close to -13 for the optimal path).
- Epsilon value over episodes: Show how exploration decreases over time.
- Final path: Visualize the grid and the path taken by the learned policy. The optimal path goes up one row, moves right, then down to the goal, avoiding the cliff.
Use libraries like matplotlib. For the path visualization, you can overlay arrows or a line on a grid image.
Analyzing Training Efficiency: Model-Free vs Model-Based
Compare Sarsa/Q-Learning (model-free) with Dyna-Q (model-based). Model-free algorithms learn directly from experience without building a model. They can be sample-inefficient because they discard past experiences. Dyna-Q, by using a model, can reuse experiences via planning, leading to faster convergence in early stages. However, if the model is inaccurate, it may hurt performance. In the cliff-walking environment, Dyna-Q typically learns the optimal policy in fewer episodes because it can 'imagine' outcomes of actions.
For your analysis, run multiple seeds and plot learning curves. Explain that planning steps (e.g., 5 or 10) allow Dyna-Q to update Q-values many times per real step, accelerating learning. But too many planning steps can cause overfitting to an imperfect model.
Deep Reinforcement Learning: Lunar Lander with DQN
Part 2 of the assignment involves training a DQN agent for the LunarLander-v2 environment. The state is an 8-dimensional vector, and actions are discrete (do nothing, fire left, main, right). You need to read the provided dqn.py code and add comments explaining each part (experience replay, target network, epsilon-greedy, etc.). Then tune hyperparameters: gamma (try 0.99), epsilon decay (slower decay for more exploration), and network architecture (e.g., two hidden layers of 64 and 32).
Consider using multiple consecutive frames (stack 4 frames) to capture velocity information, though the state already includes velocities. You can also try different optimizers like Adam with learning rate 0.00025.
Record a video of the trained agent using gym.wrappers.RecordVideo to show successful landings.
Improving Exploration: Beyond ε-Greedy
The assignment asks you to explore another exploration strategy, such as Upper Confidence Bound (UCB). UCB selects actions based on both their estimated value and an uncertainty bonus: a = argmax_a (Q(s,a) + c * sqrt(log(t) / N(s,a))), where N(s,a) is the visit count. This encourages the agent to try less-visited actions. Pros: more systematic exploration, often faster convergence in bandit problems. Cons: requires maintaining counts, computationally heavier, and can be overly optimistic in non-stationary environments. Summarize this in your report.
Final Tips for Your Homework
- Start early: Debug with a simple grid (e.g., 4×4) before scaling to 12×4.
- Use the provided random agent demo to understand the interaction loop.
- For Dyna-Q, implement a simple model as a dictionary mapping (s,a) to (s',r).
- In your report, include screenshots of learning curves and paths.
- For the DQN part, comment every function in
dqn.pythoroughly.
Reinforcement learning is a hot topic in AI, used in robotics, game playing (e.g., AlphaGo), and even finance. Mastering these foundational algorithms will give you a solid base for advanced topics. Good luck!