Part 3: Dynamic Programming

Dynamic Programming (DP) methods solve reinforcement learning problems when we have a perfect model of the environment (known states, actions, transition probabilities, and rewards). DP is essentially planning: using Bellman equations to compute value functions and optimal policies. While not used directly in model-free learning, DP lays the groundwork for understanding value functions and optimality.

Key ideas: - Policy Evaluation: Compute the value function \(v_{\pi}(s)\) for a given policy π (prediction with a model). - Policy Improvement: Given value function \(v_{\pi}\), improve the policy by acting greedily w.r.t. those values. - Policy Iteration: Iteratively alternate evaluation and improvement until convergence to an optimal policy - Value Iteration: A streamlined approach that combines evaluation and improvement in one step (update values towards optimality directly)

We’ll demonstrate these in a simple Grid World environment.

Grid World Environment

Consider a 4x4 grid (like a little board game). The agent starts in some cell and can move up, down, left, or right: - Some cells are terminal states (once reached, the episode ends). - For simplicity, let’s use the top-left (0,0) and bottom-right (3,3) as terminal states. The goal could be to reach the bottom-right corner. - Each step gives a reward of -1 (so there’s a penalty per move, which encourages finding the shortest path to a terminal).

This setup is similar to the example in Sutton & Barto (Gridworld), where the objective is to reach a goal in as few steps as possible (minimize cost).

Let’s define the environment dynamics: - States: (i,j) grid positions, i,j ∈ {0,1,2,3}. Two terminal states: (0,0) and (3,3). - Actions: up, down, left, right. - Transition: deterministic moves, but if an action would take the agent off-grid, it remains in the same state. - Reward: -1 for each action (except maybe 0 at terminal since episode ends there). - Discount factor γ = 1 (we’ll treat it as an episodic task with finite horizon implicitly, focusing on minimizing steps).

We’ll represent the value function as a 4x4 matrix for convenience.

(0,0)  (0,1)  (0,2)  (0,3)
(1,0)  (1,1)  (1,2)  (1,3)
(2,0)  (2,1)  (2,2)  (2,3)
(3,0)  (3,1)  (3,2)  (3,3)

Actions are defined as changes to the (row, col) position: - An action is a tuple (delta_row, delta_col), where: new_row = current_row + delta_row new_col = current_col + delta_col

import numpy as np

# Define grid world parameters
rows = cols = 4
terminal_states = [(0,0), (3,3)]
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]  # up, down, left, right as changes in (row, col)

def is_terminal(state):
    return state in terminal_states

# Helper to get next state and reward
def step(state, action):
    if is_terminal(state):
        return state, 0  # no change if terminal
    r, c = state
    dr, dc = action
    new_r, new_c = r + dr, c + dc
    # if out of bounds, stay in same state
    if new_r < 0 or new_r >= rows or new_c < 0 or new_c >= cols:
        new_r, new_c = r, c
    new_state = (new_r, new_c)
    reward = 0 if is_terminal(new_state) else -1
    return new_state, reward

The above code uses a Python dictionary comprehension to create a policy mapping for all non-terminal states in the grid world. For each state: The value \([0.25, 0.25, 0.25, 0.25]\) represents the probability distribution over the four possible actions (up, down, left, right).

Now we have our environment model via step(state, action). ### Policy Evaluation (Prediction) Let’s start with an arbitrary policy \(\pi\) for the non-terminal states. For example: - \(\pi(s) =\) uniformly random move (0.25 each dir)

This is just the initial policy - it’s not a good one, but it gives us a starting point for policy iteration algorithms to improve upon.

We want to compute the state-value function \(v_{\pi}(s)\) for all s. This can be done by solving the linear equations or iteratively applying the Bellman expectation backup: \[v_{k+1}(s) = \sum_a \pi(a|s)\sum_{s{\prime},r} P(s{\prime},r|s,a)[r + \gamma v_k(s{\prime})]\]

In our deterministic case with uniform random policy, that simplifies to average over all four moves: \[v_{k+1}(s) = \frac{1}{4}\sum_{a \in \{up,down,left,right\}} [ -1 + v_k(s{\prime}) ],\] for non-terminals (because each move gives reward -1 except staying or reaching terminal yields -1 or 0 appropriately).

We can initialize \(v_0(s) = 0\) for all s and iterate until values converge (the changes become very small).

# Initialize value function
V = np.zeros((rows, cols))
policy = { (i,j): [0.25, 0.25, 0.25, 0.25] for i in range(rows) for j in range(cols) if not is_terminal((i,j)) }

gamma = 1.0
theta = 1e-4  # convergence threshold
iterations = 0

while True:
    delta = 0
    new_V = V.copy()
    for i in range(rows):
        for j in range(cols):
            s = (i,j)
            if is_terminal(s):
                continue
            v = 0
            for idx, action in enumerate(actions):
                prob = policy[s][idx]
                next_s, reward = step(s, action)
                v += prob * (reward + gamma * V[next_s]) # Bellman equation for policy evaluation
            new_V[s] = v
            delta = max(delta, abs(v - V[s])) 
    V = new_V
    iterations += 1
    if delta < theta:
        break

print(f"Policy evaluation converged in {iterations} iterations.")
print("Value function under the random policy:")
print(V)
Policy evaluation converged in 172 iterations.
Value function under the random policy:
[[  0.         -12.99893866 -18.99842728 -20.99824003]
 [-12.99893866 -16.99861452 -18.9984378  -18.99842728]
 [-18.99842728 -18.9984378  -16.99861452 -12.99893866]
 [-20.99824003 -18.99842728 -12.99893866   0.        ]]

Running the above code will output a 4x4 grid of values. We expect: - Terminal states remain 0. - Other values are negative (because of the -1 cost each step). - States closer to a terminal (especially \((3,2)\) near \((3,3)\)) should have higher (less negative) value because they are closer to ending the episode and stopping the penalties.

Bellman Update is the crucial part: \[v_{k+1}(s) = \sum_a \pi(a|s)\sum_{s{\prime},r} P(s{\prime},r|s,a)[r + \gamma v_k(s{\prime})]\]

This is the Bellman equation for policy evaluation. It says that the new value of a state is the sum of the expected value of all possible next states, weighted by the probability of transitioning to those states and the reward received.

delta = max(delta, abs(v - V[s])) is the maximum change in value function at any state during the iteration.

Suggested Exercise: Try modifying the code to: - Print the value function after each iteration to see how it evolves - Use different initial values for V instead of zeros - Experiment with different gamma values to see how it affects the final values

Policy Improvement

Now that we have \(v_{\pi}\), we can improve the policy. Policy improvement means: for each state, look at the action that would yield the highest value (one-step lookahead using \(v_{\pi}\)): \[q_{\pi}(s,a) = \sum_{s{\prime},r} P(s{\prime},r|s,a)[r + \gamma v_{\pi}(s{\prime})],\] and pick a greedy action \(a^* = \arg\max_a q_{\pi}(s,a)\).

If \(a^*\) is better than the current policy’s action, we change the policy at s to choose \(a^*\) (deterministically).

Let’s perform one policy improvement step on the uniform policy’s value function.

# Policy Improvement
improved_policy = {} # This dictionary will store the improved policy

for i in range(rows): 
    for j in range(cols): 
        s = (i,j) 
        if is_terminal(s): 
            continue
        # Calculate Q-value for each possible action
        q_values = [] # Will store Q(s,a) for each action
        for action in actions: 
            next_s, reward = step(s, action) # Get the next state and reward
            # Q(s,a) = R + γV(s')
            q = reward + gamma * V[next_s] # Using our previously computed V table
            q_values.append(q) 
        # Find the best action(s)
        max_q = max(q_values) # Get the maximum q-value
        best_actions = [idx for idx, q in enumerate(q_values) if abs(q - max_q) < 1e-8] # Get the indices of the best actions (if there are multiple best actions) 
        # For deterministic policy, pick one best action
        best_action = best_actions[0] # Pick the first best action
        # (Alternatively, could make the policy probabilistic over best_actions)
        action_map = {0:"^", 1:"v", 2:"<", 3:">"}
        improved_policy[s] = action_map[best_action] # Store the improved policy
        
# Display the improved policy in grid form 
print("Improved policy (arrows indicate action):")
policy_grid = [[None]*cols for _ in range(rows)] # This list will store the improved policy in grid form
for i in range(rows):
    for j in range(cols):
        if is_terminal((i,j)):
            policy_grid[i][j] = "T"  # terminal
        else:
            policy_grid[i][j] = improved_policy[(i,j)]
for row in policy_grid:
    print(" ".join(row))
Improved policy (arrows indicate action):
T < < v
^ ^ v v
^ ^ v v
^ > > T

The policy above is displayed using arrows (^ for up, v for down, < for left, > for right). Notice how it intuitively guides the agent toward terminal states - from each cell, the arrow typically points toward the nearest goal state at (0,0) or (3,3).

What we’ve just demonstrated is how a value function can be used to derive an improved policy. Even though we started with values from a random policy, this improvement step yielded a better policy by making greedy choices based on those values.

To summarize, we’ve completed one iteration of Policy Iteration by: 1. Evaluating the initial random policy to get its value function 2. Improving the policy by selecting greedy actions based on that value function

This process can be repeated - we could evaluate this new policy and improve it again, continuing until the policy stabilizes. When no further improvements are possible, we’ve found the optimal policy.

Let’s implement the complete Policy Iteration algorithm:

# Full Policy Iteration
policy = { (i,j): [0.25,0.25,0.25,0.25] for i in range(rows) for j in range(cols) if not is_terminal((i,j)) }
V = np.zeros((rows, cols))

stable = False # This flag will track if the policy has converged
iteration = 0 # This counter will track the number of iterations
while not stable:
    iteration += 1
    # Policy Evaluation (simpler approach: a finite number of sweeps or until stable)
    while True:
        delta = 0
        for i in range(rows):
            for j in range(cols):
                s = (i,j)
                if is_terminal(s): 
                    continue
                v = V[s]
                new_v = 0
                for idx, action in enumerate(actions):
                    prob = policy[s][idx]
                    next_s, reward = step(s, action)
                    new_v += prob * (reward + gamma * V[next_s])
                V[s] = new_v
                delta = max(delta, abs(v - new_v))
        if delta < 1e-4:
            break

    # Policy Improvement
    stable = True
    for i in range(rows):
        for j in range(cols):
            s = (i,j)
            if is_terminal(s): 
                continue
            # current policy action (deterministic for simplicity: pick first with prob)
            current_action = np.argmax(policy[s])
            # find best action via one-step lookahead
            q_values = []
            for action in actions:
                next_s, reward = step(s, action)
                q_values.append(reward + gamma * V[next_s])
            best_action = np.argmax(q_values)
            # If policy action is not best action, update policy
            if best_action != current_action: 
                stable = False
            # make new policy greedy (deterministic)
            policy[s] = [0.0]*4
            policy[s][best_action] = 1.0
    if stable:
        print(f"Policy iteration converged in {iteration} iterations.")
        break

# Final optimal value function and policy
print("Optimal value function:")
print(V)
print("Optimal policy:")
for i in range(rows):
    row_policy = []
    for j in range(cols):
        if is_terminal((i,j)):
            row_policy.append("T")
        else:
            act = np.argmax(policy[(i,j)])
            arrow = {0:"^", 1:"v", 2:"<", 3:">"}[act]
            row_policy.append(arrow)
    print(" ".join(row_policy))
Policy iteration converged in 3 iterations.
Optimal value function:
[[ 0.  0. -1. -2.]
 [ 0. -1. -2. -1.]
 [-1. -2. -1.  0.]
 [-2. -1.  0.  0.]]
Optimal policy:
T < < v
^ ^ ^ v
^ ^ v v
^ > > T

Lets break down the policy iteration algorithm. There are two main loops: 1. Outer Loop: Policy Iteration (continues until policy is stable) 2. Inner Loop: Policy Evaluation (iterates until value function converges)

Inner Convergence (Policy Evaluation): Values stabilize when the maximum change in any state’s value (delta) is below threshold (1e-4)

Outer Convergence (Policy Iteration): Policy converges when no changes are made to the policy during an evaluation-improvement cycle.

This should converge in a few iterations. The optimal policy will direct the agent to the nearest terminal in the minimum number of steps (essentially a shortest path to either (0,0) or (3,3)). The optimal value function will show the negative of the distance to a terminal (since each step costs -1). For example, the value of a cell might equal \(-d\) where d is the Manhattan distance to the closest terminal.

IMPORTANT: A common source of bugs is actions in the policy that are not defined in the same order as the actions list. For example, if you use np.argmax(policy[s]) to get the current action, and the actions list is [0, 1, 2, 3], but the policy is defined as a dictionary with arbitrary key order like {3: 0.25, 1: 0.25, 0: 0.25, 2: 0.25}, then the action will be incorrect since dictionaries don’t maintain order.

Here’s a better way to structure this using an Enum or constants to ensure consistency:

class Action(Enum):
    UP = 0
    DOWN = 1
    LEFT = 2
    RIGHT = 3

However, object oriented programming is outside the scope of this tutorial, so we won’t go into more detail here.

Value Iteration

We demonstrated policy iteration above. Value Iteration is another DP approach that skips having a separate policy evaluation step. It directly updates the value function with the best possible action at each step: \[V_{k+1}(s) = \max_a \sum_{s{\prime},r} P(s{\prime},r|s,a)[r + \gamma V_k(s{\prime})]\] This is essentially applying the Bellman optimality update repeatedly.

Let’s do value iteration for the same grid:

# Value Iteration
V_vi = np.zeros((rows, cols))  # Start with all values at 0
gamma = 1.0
theta = 1e-4
iters = 0
while True:
    iters += 1
    delta = 0
    for i in range(rows):
        for j in range(cols):
            s = (i,j)
            if is_terminal(s):
                continue
            v = V_vi[s]  # Store current value for comparison later
            # Bellman optimality backup
            best_q = -float('inf') # Initialize worst possible value
            # Compute the best possible action for this state
            for action in actions:
                next_s, reward = step(s, action)
                q = reward + gamma * V_vi[next_s] # Bellman equation
                if q > best_q:
                    best_q = q
            V_vi[s] = best_q # Update state value to best found
            delta = max(delta, abs(v - best_q))  # Track biggest change
    if delta < theta:
        break # We're done!

print(f"Value Iteration converged in {iters} sweeps.")
print("Optimal value function (from Value Iteration):")
print(V_vi)
Value Iteration converged in 3 sweeps.
Optimal value function (from Value Iteration):
[[ 0.  0. -1. -2.]
 [ 0. -1. -2. -1.]
 [-1. -2. -1.  0.]
 [-2. -1.  0.  0.]]

This should yield the same optimal value function as policy iteration (or very close). We can extract the policy by taking the argmax actions from this value function:

optimal_policy = []
for i in range(rows):
    row_policy = []
    for j in range(cols):
        s = (i,j)
        if is_terminal(s):
            row_policy.append("T")
        else:
            # choose best action for this state
            best_action = None
            best_q = -float('inf')
            for idx, action in enumerate(actions):
                next_s, reward = step(s, action)
                q = reward + gamma * V_vi[next_s]
                if q > best_q:
                    best_q = q
                    best_action = idx
            arrow = {0:"^", 1:"v", 2:"<", 3:">"}[best_action]
            row_policy.append(arrow)
    optimal_policy.append(" ".join(row_policy))

print("Optimal policy (from Value Iteration):")
for row in optimal_policy:
    print(row)
Optimal policy (from Value Iteration):
T < < v
^ ^ ^ v
^ ^ v v
^ > > T

It should match the earlier optimal policy.

However, value iteration is more efficient than policy iteration because it doesn’t require separate policy evaluation steps. Policy Iteration has two nested loops (1) Outer loop for policy updates and (2) Inner loop for policy evaluation (which itself needs multiple sweeps). Value Iteration combines these into a single loop. Each sweep directly updates values towards optimality and converges to the optimal value function faster since it doesn’t need to evaluate the intermediate policy.

We used Dynamic Programming to compute an optimal policy for a known MDP: - Policy Evaluation: Computed value functions for a given policy by iterative updates until convergence - Policy Improvement: Greedily improved the policy using one-step lookahead on the value function - Policy Iteration: Repeated evaluation and improvement to reach an optimal policy - Value Iteration: Directly computed the optimal value function by iterative Bellman optimality updates

DP methods guarantee finding the optimal policy if the model is known and the state space is not too large. However, they require full knowledge of transition probabilities and are computationally expensive for large problems. In model-free scenarios, we turn to learning methods like Monte Carlo and Temporal-Difference, which we’ll explore next.


← Part 2 | All labs | Part 4 →