Temporal-Difference (TD) learning is a powerful model-free approach that blends Monte Carlo and Dynamic Programming ideas. Like MC, TD learns from raw experience without a model. Like DP, TD updates estimates based partly on other learned estimates (bootstrapping). This allows TD to learn online after each step, without waiting for the episode to finish. Key concepts: - TD(0) Prediction: Also known as one-step TD or TD(0), updates the value \(V(s)\) toward the observed reward plus the value of the next state: \[V(s) \leftarrow V(s) + \alpha [R_{t+1} + \gamma V(s_{t+1}) - V(s)]\] This update is based on the TD error \(\delta = R_{t+1} + \gamma V(s_{t+1}) - V(s)\).
SARSA (on-policy TD control): Updates action-value \(Q(s,a)\) toward \(R_{t+1} + \gamma Q(s_{t+1}, a_{t+1})\), using the action actually taken next (following current policy)
Q-Learning (off-policy TD control): Updates \(Q(s,a)\) toward \(R_{t+1} + \gamma \max_{a’} Q(s_{t+1}, a’)\), using the greedy best action for the next state (target policy is different from behavior)
On-policy vs Off-policy: SARSA is on-policy (learns value of the policy it follows, including its exploration) whereas Q-learning is off-policy (learns value of the optimal policy independent of the agent’s behavior).
Let’s illustrate TD(0) prediction first, then implement SARSA and Q-learning for a control task.
TD(0) Prediction Example
We’ll use the gambler’s problem (or we could use grid world) and see how TD updates converge to true values.
Let’s do TD(0) for gambler policy “bet 1”, as we did with MC, and see if it approximates the value function.
import randomdef td0_prediction(policy_func, alpha=0.1, episodes=1000, start_capital=5, goal=10, p_head=0.5):# Initialize value estimates V = {s: 0.0for s inrange(goal+1)} V[goal] =1.0 V[0] =0.0for ep inrange(episodes):# start at given start_capital each episode capital = start_capitalwhile capital notin (0, goal): a = policy_func(capital) next_capital = capital + a if random.random() < p_head else capital - a reward =0# TD update V[capital] += alpha * (reward + V[next_capital] - V[capital]) capital = next_capital# Terminal state update (the last transition gives reward 1 or 0)# Actually, we can handle the terminal step when loop breaks:if capital == goal:# If we consider the step into terminal yielding reward 1:# last state before terminal (call it prev) would have gotten update in loop as:# V(prev) += alpha * (1 + 0 - V(prev))# But since we broke out after reaching terminal, let's simulate the last update:passreturn Vtd_values = td0_prediction(policy_func=lambda s: 1, alpha=0.1, episodes=10000, start_capital=5, goal=10, p_head=0.5)print("TD(0) estimated value of state 5 (starting state):", round(td_values[5], 3))
TD(0) estimated value of state 5 (starting state): 0.515
Code Explanation
we start with some capital (default 5)
our goal is to reach a certain amount (default 10)
each round you can bet any amount up to your current capital
you have a 50% chance of winning (p_head=0.5) If you win, you get the amount you bet; if you lose, you lose the amount you bet. You want to find the optimal betting strategy
We might need to refine handling of the terminal reward. But essentially, TD(0) will update on each step towards the value of the next state. Over many episodes, it should converge to the true \(v_{\pi}(5) \approx 0.5\). You can try different starting states or do multiple start states in episodes to cover more states.
Think of it like a gambler trying to turn $5 into $10 by making a series of bets. The value function \(V(s)\) represents the probability of winning (reaching $10) starting from state \(s\) under the current policy.
TD has an advantage of updating during the episode, and it can learn from incomplete episodes (if we use continuing tasks or bootstrapping without waiting for termination).
Now, let’s move to control with SARSA and Q-learning, which are more commonly demonstrated on tasks like Grid World or Cliff Walking.
SARSA vs Q-Learning on Grid World
We’ll use the grid world, but introduce a slight twist: let’s add a “bad state” to illustrate the difference between on-policy and off-policy. A classic example is the “Cliff” environment (Sutton & Barto Example 6.6) where following the optimal path has risk of falling off a cliff (big negative reward) vs a safer path.
For simplicity, let’s create a small grid where: - Start at (0,0), goal at (0,3). - There’s a “cliff” at positions (1,1) for example that gives a large negative reward if stepped into. - The optimal policy might go around the cliff in an on-policy method vs jump over in off-policy.
SARSA and Q-learning: - SARSA will update using the action it actually takes next (which, if it’s following \(\epsilon\)-greedy, might be suboptimal, so it ends up learning the value of its exploratory policy). - Q-learning will update as if it always takes the optimal next action (even if it didn’t), thus it tends to learn optimistic values and converge to the true optimal policy value.
Let’s still do a simple grid with some penalty state:
The critical point making this on-policy is that the value of the next state-action pair (Q(s’, a’)) comes directly from the action (a’) actually taken by the current policy, which could include exploration.
Here, the critical off-policy part is clearly marked by the presence of the (_{a’}Q(s’,a’)). This means the update is computed using the best possible next action, regardless of the actual exploratory action chosen by the current policy. Thus, the policy being evaluated (greedy) differs from the behavior policy used to select actions (e.g., epsilon-greedy).
In our above code:
SARSA explicitly selects the next action (next_action) via the same policy used for exploration (epsilon-greedy), thus directly using this action’s value:
Q-Learning does not directly select an action via exploration for updating, but instead takes the maximum value of the next state’s Q-values:
next_max =max([Q.get((next_state, a), 0.0) for a in actions])td_target = reward + gamma * next_max
This explicitly demonstrates why SARSA is on-policy (next action is sampled from the policy being improved), while Q-learning is off-policy (next action used in the update is always the greedy, optimal action, not necessarily chosen by the exploratory policy).
To reiterate: - On-policy methods (SARSA) learn about the policy being executed, which includes exploration. They generally result in safer learning (they don’t propagate unrealistically high values for risky states because those high values won’t materialize under the \(\epsilon\)-greedy behavior that might fall off). - Off-policy (Q-learning) learns the optimal target policy independent of the current behavior. It can converge to the optimal faster, but one must be careful with convergence conditions (need sufficient exploration, etc.).