Part 4: Monte Carlo Methods

So far, we’ve seen how to evaluate and improve policies with a known model (DP) and how to handle simple bandits. Now we tackle model-free prediction and control using Monte Carlo (MC) methods. Monte Carlo methods learn from sample episodes — they require no knowledge of environment probabilities, only the ability to sample episodes by interacting with the environment.

Monte Carlo Prediction: Estimate the value function \(v_{\pi}(s)\) by averaging returns observed after visits to state s under policy π

First-Visit vs Every-Visit MC: Two ways to handle multiple visits to a state in an episode: - First-Visit MC: Only use the first time a state is visited in each episode to update its value - Every-Visit MC: Use every occurrence of the state in the episode to update the average (still converges to true value, but observations are not independent)

Monte Carlo Control: Improve policy using MC estimates. Requires enough exploration through either: - Using Exploring Starts (start episodes in all state-action pairs eventually) or - Using \(\epsilon\)-soft policies (policy that always has a non-zero chance to explore all actions)

We will illustrate MC prediction and control on the Gambler’s Problem (from section 2) for practice, and also demonstrate the concepts in the Grid World.

Monte Carlo Prediction (Policy Evaluation)

Let’s return to the gambler environment (small version: goal=10, p=0.5) and evaluate the policy “always bet 1” via Monte Carlo simulation, instead of using the known probabilities. We will implement first-visit MC prediction: 1. Simulate many episodes using the policy. 2. For each episode, record the return \(G_t\) from each state visited. 3. Update the value estimate for the first visit of each state by averaging the returns.

Because the reward is 0 until the end and 1 at goal, the return is essentially 1 if the episode ends in success, 0 if failure. So the value of a state is the probability of eventually succeeding from that state (as earlier).

import random

# Gambler environment from state with always-bet-1 policy
def generate_gambler_episode(start_capital, goal, p_head=0.5):
    """
    Simulates one complete episode of the gambler's problem with a fixed bet-1 policy
    Parameters:
    - start_capital: Initial money the gambler has
    - goal: Target amount to win
    - p_head: Probability of winning each bet (default 0.5 for fair coin)
    """
    episode = []
    capital = start_capital
    while capital not in (0, goal):
        action = 1  # bet 1
        next_capital = capital + 1 if random.random() < p_head else capital - 1
        reward = 0  # no intermediate reward
        episode.append((capital, action, reward))
        capital = next_capital
    # Append terminal state transition
    final_reward = 1 if capital == goal else 0
    episode.append((capital, None, final_reward))  # terminal state
    return episode

The final state is handled specially with (capital, None, final_reward) where: - None for action because it’s a terminal state - final_reward is 1 if we reached the goal, 0 if we went bust Here’s a simple example to illustrate:

# Example episode with start_capital=2, goal=4, p_head=0.5 might look like:
[(2, 1, 0),    # Start with $2, bet $1, no reward yet
 (3, 1, 0),    # Won first bet (now $3), bet $1, no reward yet
 (2, 1, 0),    # Lost second bet (back to $2), bet $1, no reward yet
 (3, 1, 0),    # Won third bet (now $3), bet $1, no reward yet
 (4, None, 1)] # Won fourth bet, reached goal, get reward of 1
[(2, 1, 0), (3, 1, 0), (2, 1, 0), (3, 1, 0), (4, None, 1)]

This function is crucial for Monte Carlo methods because it generates the experience samples we’ll learn from. The complete episodes allow us to calculate actual returns (G) rather than estimated values.The structure (s,a,r) matches the standard format for reinforcement learning algorithms.

# First-visit MC policy evaluation
def mc_first_visit_value(start_states, goal, p_head, policy_func, episodes=10000):
    values = {s: 0.0 for s in range(goal)}
    returns_sum = {s: 0.0 for s in range(goal)}
    returns_count = {s: 0 for s in range(goal)}

    for _ in range(episodes):
        # We can start episodes from a random state (to ensure exploring starts)
        s0 = random.choice(start_states)
        episode = []
        capital = s0
        # Generate episode following policy
        while capital not in (0, goal):
            bet = policy_func(capital)
            next_capital = capital + bet if random.random() < p_head else capital - bet
            reward = 0
            episode.append((capital, reward))
            capital = next_capital
        # Terminal state reached
        final_reward = 1 if capital == goal else 0
        episode.append((capital, final_reward))
        # Calculate returns backward
        G = 0
        visited = set()
        for state, reward in reversed(episode):
            G = reward + G  # since reward is 0 except final step (The terminal reward (0 or 1) propagates back)
            if state not in visited and state not in (0, goal):
                visited.add(state)
                returns_sum[state] += G
                returns_count[state] += 1
                values[state] = returns_sum[state] / returns_count[state]
    return values

# Use MC to evaluate policy (always bet 1) for gambler states 1..9
policy_bet1_func = lambda s: 1
mc_values = mc_first_visit_value(start_states=range(1,10), goal=10, p_head=0.5, policy_func=policy_bet1_func, episodes=5000)
print("MC estimated values for gambler states 1..9 (bet-1 policy):")
for s in range(1, 10):
    print(f"V({s}) ≈ {mc_values[s]:.2f}")
MC estimated values for gambler states 1..9 (bet-1 policy):
V(1) ≈ 0.10
V(2) ≈ 0.19
V(3) ≈ 0.29
V(4) ≈ 0.39
V(5) ≈ 0.50
V(6) ≈ 0.60
V(7) ≈ 0.70
V(8) ≈ 0.81
V(9) ≈ 0.91

This should output values roughly increasing linearly from state 1 to 9 (similar to earlier simulation). More episodes yield a closer approximation.

Note: We used exploring starts by starting episodes from random states to ensure each state gets visited enough. Otherwise, if we only started at state 5 every time, states like 1 or 9 might rarely be visited under the policy.

The above code used first-visit MC. To switch to every-visit MC, we would remove the visited check and update every time a state appears. First-visit has the advantage that each state’s returns are i.i.d. samples of \(v_{\pi}(s)\) .

Monte Carlo Control (Finding Optimal Policy)

Now let’s try MC control to improve the policy. We’ll use the Exploring Starts approach first: - We will maintain an estimate of \(Q(s,a)\) for each state-action. - We will generate episodes with exploring starts (start at random state with random action to ensure all pairs seen). - After each episode, we will update \(Q(s,a)\) for the visited state-action pairs toward the observed returns. - Then we improve the policy greedily with respect to \(Q\) (make it \(\epsilon\)-soft if needed to continue exploration).

For gambler’s problem: - State = capital (1 to 9 for goal=10). - Actions = bet amount (1 to min(s, goal-s)). You can’t bet more money than you have (s) and Yyou don’t want to bet more than needed to reach the goal (goal-s). We want to maximize the probability of reaching goal.

Let’s implement MC Control with exploring starts for the gambler:

import math

# Monte Carlo Control with Exploring Starts for gambler
def mc_control_gambler(goal=10, p_head=0.3, episodes=10000):
    # Initialize Q and returns tracking
    states = range(1, goal)  # 1..goal-1
    # actions for each state: 1..min(s, goal-s)
    actions = {s: list(range(1, min(s, goal-s) + 1)) for s in states}
    Q = { (s,a): 0.0 for s in states for a in actions[s] }
    returns_sum = { (s,a): 0.0 for s in states for a in actions[s] }
    returns_count = { (s,a): 0 for s in states for a in actions[s] }
    # Start with arbitrary policy (say always bet 1 initially)
    policy = { s: 1 for s in states }
    
    for episode_num in range(episodes):
        # Exploring start: pick random state and random action (not terminal)
        s0 = random.choice(list(states))
        a0 = random.choice(actions[s0])
        # Generate episode starting from (s0, a0)
        episode = []
        capital = s0
        action = a0
        # Apply the first action from exploring start
        next_capital = capital + action if random.random() < p_head else capital - action
        reward = 0
        episode.append((capital, action, reward))
        capital = next_capital
        # Follow current policy thereafter
        while capital not in (0, goal):
            a = policy[capital]
            next_capital = capital + a if random.random() < p_head else capital - a
            reward = 0
            episode.append((capital, a, reward))
            capital = next_capital
        # Episode ended
        final_reward = 1 if capital == goal else 0
        episode.append((capital, None, final_reward))
        
        # Calculate returns G and update first-visit
        G = 0
        visited_sa = set()
        for state, action, reward in reversed([(x,y,z) for (x,y,z) in episode]):
            G = reward + G
            if action is not None:  # skip terminal state in episode
                sa = (state, action)
                if sa not in visited_sa:
                    visited_sa.add(sa)
                    returns_sum[sa] += G
                    returns_count[sa] += 1
                    Q[sa] = returns_sum[sa] / returns_count[sa]
                    # Improve policy for that state
                    # Choose action with max Q
                    best_a = max(actions[state], key=lambda a: Q[(state,a)])
                    policy[state] = best_a
    return policy, Q

opt_policy, opt_Q = mc_control_gambler(goal=10, p_head=0.4, episodes=50000)
print("Learned optimal policy (bet amounts) for states 1..9:")
for s in range(1, 10):
    print(f"State {s}: bet {opt_policy[s]}")
Learned optimal policy (bet amounts) for states 1..9:
State 1: bet 1
State 2: bet 2
State 3: bet 3
State 4: bet 4
State 5: bet 5
State 6: bet 4
State 7: bet 3
State 8: bet 2
State 9: bet 1

Understanding Optimal Betting Strategy in the Gambler’s Problem

When p < 0.5 (unfair coin), the optimal betting strategy might seem counterintuitive at first. Let’s understand why making larger bets is actually optimal in this scenario.

Consider a gambler who needs to win $3 with p = 0.4 (40% chance of winning each bet). They have two possible strategies:

  1. Conservative Strategy: Bet $1 three times
    • Need to win all three bets to reach goal
    • Probability = 0.4 * 0.4 * 0.4 = 0.064 (6.4% chance of success)
  2. Aggressive Strategy: Bet $3 once
    • Need to win just one bet
    • Probability = 0.4 (40% chance of success)

The aggressive strategy is clearly superior here. This illustrates a fundamental principle: when the odds are against you (p < 0.5), making multiple small bets is worse than making fewer larger bets. This is because:

  • Each bet has an expected negative return
  • More bets means more opportunities for the negative expected value to manifest
  • The law of large numbers works against you, pushing your results toward the negative expected value
  • By making fewer, larger bets, you minimize exposure to the unfavorable odds

This is why our Monte Carlo control algorithm, starting from a conservative “bet 1” policy, will eventually learn to make larger bets when p < 0.5. The Q-values for larger bets will yield higher expected returns, leading to policy improvement in this direction.

Note: Monte Carlo might need many episodes to converge on the optimal policy, especially for nontrivial p. We used exploring starts to ensure all actions are tried. Without exploring starts, we could use an ε-greedy policy improvement (on-policy MC control) where we ensure the policy occasionally tries suboptimal actions .

Exercises

Exercise 1: Implement Monte Carlo prediction and control for the Grid World environment. Compare the learned value functions and policies with those obtained from Dynamic Programming.

Exercise 2: Apply Monte Carlo methods to learn optimal play in Blackjack, following the example in Sutton & Barto. Analyze how the learned strategy compares to basic strategy tables used by players.

Exercise 3: Implement on-policy Monte Carlo control using an \(\epsilon\)-soft policy instead of exploring starts:

  1. Initialize a policy \(\pi\) that is \(\epsilon\)-greedy with respect to \(Q\), ensuring \(P(a|s) > 0\) for all state-action pairs.

  2. Generate complete episodes by following policy \(\pi\).

  3. For each state-action pair \((s,a)\) visited in the episode, update the action-value function \(Q(s,a)\) using the observed returns.

  4. Improve policy \(\pi\) by making it more greedy with respect to \(Q\) while maintaining \(\epsilon\)-softness (i.e., ensure all actions still have some non-zero probability of being selected). Compare the learning performance with the exploring starts approach.

Summary

Monte Carlo methods allow us to learn value functions and policies from experience: - We estimated values by averaging returns (the essence of Monte Carlo prediction) - We saw first-visit vs every-visit approaches, which both converge to true values - For control, we ensured sufficient exploration either via exploring starts or maintaining an ε-soft policy, then improved towards optimal by comparing action returns. - MC methods wait until an episode ends to update, which can be a downside for long episodes or continuing tasks.

Next, we’ll explore Temporal-Difference (TD) Learning, which learns from incomplete episodes and combines ideas of Monte Carlo and DP.


← Part 3 | All labs | Part 5 →