Part 5: Temporal Difference Learning

Temporal-Difference (TD) learning is a powerful model-free approach that blends Monte Carlo and Dynamic Programming ideas. Like MC, TD learns from raw experience without a model. Like DP, TD updates estimates based partly on other learned estimates (bootstrapping). This allows TD to learn online after each step, without waiting for the episode to finish. Key concepts: - TD(0) Prediction: Also known as one-step TD or TD(0), updates the value $V(s)$ toward the observed reward plus the value of the next state: \[V(s) \leftarrow V(s) + \alpha [R_{t+1} + \gamma V(s_{t+1}) - V(s)]\] This update is based on the TD error $\delta = R_{t+1} + \gamma V(s_{t+1}) - V(s)$.

SARSA (on-policy TD control): Updates action-value $Q(s,a)$ toward $R_{t+1} + \gamma Q(s_{t+1}, a_{t+1})$, using the action actually taken next (following current policy)
Q-Learning (off-policy TD control): Updates $Q(s,a)$ toward $R_{t+1} + \gamma \max_{a’} Q(s_{t+1}, a’)$, using the greedy best action for the next state (target policy is different from behavior)
On-policy vs Off-policy: SARSA is on-policy (learns value of the policy it follows, including its exploration) whereas Q-learning is off-policy (learns value of the optimal policy independent of the agent’s behavior).

Let’s illustrate TD(0) prediction first, then implement SARSA and Q-learning for a control task.

TD(0) Prediction Example

We’ll use the gambler’s problem (or we could use grid world) and see how TD updates converge to true values.

Let’s do TD(0) for gambler policy “bet 1”, as we did with MC, and see if it approximates the value function.


import random
def td0_prediction(policy_func, alpha=0.1, episodes=1000, start_capital=5, goal=10, p_head=0.5):
    # Initialize value estimates
    V = {s: 0.0 for s in range(goal+1)}
    V[goal] = 1.0
    V[0] = 0.0
    for ep in range(episodes):
        # start at given start_capital each episode
        capital = start_capital
        while capital not in (0, goal):
            a = policy_func(capital)
            next_capital = capital + a if random.random() < p_head else capital - a
            reward = 0
            # TD update
            V[capital] += alpha * (reward + V[next_capital] - V[capital])
            capital = next_capital
        # Terminal state update (the last transition gives reward 1 or 0)
        # Actually, we can handle the terminal step when loop breaks:
        if capital == goal:
            # If we consider the step into terminal yielding reward 1:
            # last state before terminal (call it prev) would have gotten update in loop as:
            # V(prev) += alpha * (1 + 0 - V(prev))
            # But since we broke out after reaching terminal, let's simulate the last update:
            pass
    return V

td_values = td0_prediction(policy_func=lambda s: 1, alpha=0.1, episodes=10000, start_capital=5, goal=10, p_head=0.5)
print("TD(0) estimated value of state 5 (starting state):", round(td_values[5], 3))

TD(0) estimated value of state 5 (starting state): 0.515

Code Explanation

we start with some capital (default 5)
our goal is to reach a certain amount (default 10)
each round you can bet any amount up to your current capital
you have a 50% chance of winning (p_head=0.5) If you win, you get the amount you bet; if you lose, you lose the amount you bet. You want to find the optimal betting strategy

We might need to refine handling of the terminal reward. But essentially, TD(0) will update on each step towards the value of the next state. Over many episodes, it should converge to the true $v_{\pi}(5) \approx 0.5$. You can try different starting states or do multiple start states in episodes to cover more states.

Think of it like a gambler trying to turn $5 into $10 by making a series of bets. The value function $V(s)$ represents the probability of winning (reaching $10) starting from state $s$ under the current policy.

TD has an advantage of updating during the episode, and it can learn from incomplete episodes (if we use continuing tasks or bootstrapping without waiting for termination).

Now, let’s move to control with SARSA and Q-learning, which are more commonly demonstrated on tasks like Grid World or Cliff Walking.

SARSA vs Q-Learning on Grid World

We’ll use the grid world, but introduce a slight twist: let’s add a “bad state” to illustrate the difference between on-policy and off-policy. A classic example is the “Cliff” environment (Sutton & Barto Example 6.6) where following the optimal path has risk of falling off a cliff (big negative reward) vs a safer path.

For simplicity, let’s create a small grid where: - Start at (0,0), goal at (0,3). - There’s a “cliff” at positions (1,1) for example that gives a large negative reward if stepped into. - The optimal policy might go around the cliff in an on-policy method vs jump over in off-policy.

SARSA and Q-learning: - SARSA will update using the action it actually takes next (which, if it’s following $\epsilon$-greedy, might be suboptimal, so it ends up learning the value of its exploratory policy). - Q-learning will update as if it always takes the optimal next action (even if it didn’t), thus it tends to learn optimistic values and converge to the true optimal policy value.

Let’s still do a simple grid with some penalty state:

grid_rows, grid_cols = 3, 4
start_state = (0,0)
goal_state = (0,3)
cliff_state = (1,2)  # stepping here gives -10 reward (for example)

We will implement a step function for this:

def grid_step(state, action):
    if state == goal_state:
        return state, 0  # absorbing
    r, c = state
    dr, dc = action
    nr, nc = r+dr, c+dc
    # bounds check
    if nr < 0 or nr >= grid_rows or nc < 0 or nc >= grid_cols:
        nr, nc = r, c  # hit wall, stay (could also give a small neg reward)
    new_state = (nr, nc)
    # reward logic
    if new_state == cliff_state:
        reward = -10
    elif new_state == goal_state:
        reward = +10
    else:
        reward = -1  # each step cost
    return new_state, reward

Now define actions and run SARSA and Q-learning episodes.

We’ll keep it short and not do too many episodes for demonstration.

actions = [(0,1),(0,-1),(1,0),(-1,0)]  # right, left, down, up moves
def epsilon_greedy_action(Q, state, epsilon=0.1):
    if random.random() < epsilon:
        return random.choice(actions)
    else:
        # choose best action from Q
        q_values = [Q.get((state,a), 0) for a in actions]
        max_q = max(q_values)
        best_actions = [a for a, q in zip(actions, q_values) if q == max_q]
        return random.choice(best_actions)

def run_sarsa(alpha=0.5, gamma=1.0, epsilon=0.1, episodes=1000):
    Q = {}
    for ep in range(episodes):
        state = start_state
        action = epsilon_greedy_action(Q, state, epsilon)
        while state != goal_state:
            next_state, reward = grid_step(state, action)
            next_action = epsilon_greedy_action(Q, next_state, epsilon)
            # Q(s,a) update
            current_q = Q.get((state, action), 0.0)
            next_q = Q.get((next_state, next_action), 0.0)
            td_target = reward + gamma * next_q
            Q[(state, action)] = current_q + alpha * (td_target - current_q)
            state = next_state
            action = next_action
    return Q

def run_q_learning(alpha=0.5, gamma=1.0, epsilon=0.1, episodes=1000):
    Q = {}
    for ep in range(episodes):
        state = start_state
        while state != goal_state:
            action = epsilon_greedy_action(Q, state, epsilon)
            next_state, reward = grid_step(state, action)
            # Q-learning update
            current_q = Q.get((state, action), 0.0)
            # next state's best action value
            next_q_values = [Q.get((next_state,a), 0.0) for a in actions]
            next_max = max(next_q_values) if next_q_values else 0.0
            td_target = reward + gamma * next_max
            Q[(state, action)] = current_q + alpha * (td_target - current_q)
            state = next_state
    return Q

# Run SARSA and Q-learning
Q_sarsa = run_sarsa(episodes=1000)
Q_qlearn = run_q_learning(episodes=1000)

# Derive policies from Q (greedy)
policy_sarsa = {}
policy_qlearn = {}
for r in range(grid_rows):
    for c in range(grid_cols):
        s = (r,c)
        # skip goal
        if s == goal_state:
            continue
        # find best action in Q for state
        q_vals_sarsa = {a: Q_sarsa.get((s,a), 0.0) for a in actions}
        q_vals_q = {a: Q_qlearn.get((s,a), 0.0) for a in actions}
        best_a_sarsa = max(q_vals_sarsa, key=q_vals_sarsa.get)
        best_a_q = max(q_vals_q, key=q_vals_q.get)
        policy_sarsa[s] = best_a_sarsa
        policy_qlearn[s] = best_a_q

print("Greedy policy from SARSA:")
for r in range(grid_rows):
    row = []
    for c in range(grid_cols):
        s = (r,c)
        if s == goal_state:
            row.append("G")
        elif policy_sarsa.get(s) is None:
            row.append(".")
        else:
            arrow = { (0,1): ">", (0,-1): "<", (1,0): "v", (-1,0): "^" }
            row.append(arrow[policy_sarsa[s]])
    print(" ".join(row))

print("\nGreedy policy from Q-learning:")
for r in range(grid_rows):
    row = []
    for c in range(grid_cols):
        s = (r,c)
        if s == goal_state:
            row.append("G")
        elif policy_qlearn.get(s) is None:
            row.append(".")
        else:
            arrow = { (0,1): ">", (0,-1): "<", (1,0): "v", (-1,0): "^" }
            row.append(arrow[policy_qlearn[s]])
    print(" ".join(row))

Greedy policy from SARSA:
< > > G
^ ^ ^ ^
^ ^ < <

Greedy policy from Q-learning:
> > > G
^ ^ ^ ^
v > < ^

On-Policy vs. Off-Policy

Consider the general Temporal Difference (TD) update:

\[ Q(s,a) \leftarrow Q(s,a) + \alpha\Bigl[\underbrace{r + \gamma Q(s', a')}_{\text{on-policy (SARSA)}} - Q(s,a)\Bigr] \]

The critical point making this on-policy is that the value of the next state-action pair (Q(s’, a’)) comes directly from the action (a’) actually taken by the current policy, which could include exploration.

In contrast, the Q-Learning update is:

\[ Q(s,a) \leftarrow Q(s,a) + \alpha\Bigl[\underbrace{r + \gamma \max_{a'}Q(s', a')}_{\text{off-policy (Q-Learning)}} - Q(s,a)\Bigr] \]

Here, the critical off-policy part is clearly marked by the presence of the (_{a’}Q(s’,a’)). This means the update is computed using the best possible next action, regardless of the actual exploratory action chosen by the current policy. Thus, the policy being evaluated (greedy) differs from the behavior policy used to select actions (e.g., epsilon-greedy).

In our above code:

SARSA explicitly selects the next action (next_action) via the same policy used for exploration (epsilon-greedy), thus directly using this action’s value:

td_target = reward + gamma * Q.get((next_state, next_action), 0.0)

Q-Learning does not directly select an action via exploration for updating, but instead takes the maximum value of the next state’s Q-values:

next_max = max([Q.get((next_state, a), 0.0) for a in actions])
td_target = reward + gamma * next_max

This explicitly demonstrates why SARSA is on-policy (next action is sampled from the policy being improved), while Q-learning is off-policy (next action used in the update is always the greedy, optimal action, not necessarily chosen by the exploratory policy).

To reiterate: - On-policy methods (SARSA) learn about the policy being executed, which includes exploration. They generally result in safer learning (they don’t propagate unrealistically high values for risky states because those high values won’t materialize under the $\epsilon$-greedy behavior that might fall off). - Off-policy (Q-learning) learns the optimal target policy independent of the current behavior. It can converge to the optimal faster, but one must be careful with convergence conditions (need sufficient exploration, etc.).

← Part 4 | All labs