Part 1: Prediction and Control

Now that we’ve covered basic probability, let’s dive into a simple reinforcement learning scenario: a gambling game. This example will introduce core RL ideas of prediction (evaluating how good a given strategy is) and control (finding a better strategy). We will define a simple gambling environment and simulate an agent’s decisions within it.

Problem Setup: A Gambler’s Game

Imagine a gambler who starts with a certain amount of money and aims to reach a target amount. At each turn, the gambler can bet some of their money on a coin flip: - If the coin comes up Heads (win), the gambler wins the amount bet (it gets added to their capital). - If Tails (lose), the gambler loses the amount bet (subtracted from their capital). - The coin might be biased with probability p of Heads.

The game ends either when the gambler reaches the target (goal) or loses all money (bust).

This is known as the Gambler’s Problem from Sutton and Barto’s RL textbook. - States: The gambler’s capital (from $0$ to $Goal$). - Actions: The bet amount (from $1$ up to the smaller of current capital or amount needed to reach goal). - Transition: Win or lose based on the coin flip. - Reward: We can define a reward of +1 for reaching the goal, and 0 otherwise (no reward each step except at the end). This way, maximizing total reward is equivalent to maximizing the probability of reaching the goal.

For simplicity, let’s implement a smaller version: goal = $10, starting capital = $5, and a fair coin (p = 0.5). The gambler can bet $1 each turn (to keep it simple initially).

Simulation of a Fixed Policy

Policy: Let’s start with a simple policy: always bet $1 until the game ends.

We will simulate many episodes (games) with this policy to estimate: - Probability of reaching the goal (this is the value of the starting state under this policy). - The distribution of episode lengths (how many bets until finish).

Let’s write a simulation for one episode and then loop it for many episodes.

import random

def simulate_gamble(start_capital, goal, p_head=0.5):
    """
    Simulate one episode of the gambling game.
    Returns True if goal reached (win), False if gambler goes broke (lose).
    Policy: always bet 1.
    """
    capital = start_capital
    while 0 < capital < goal:
        # Always bet 1
        bet = 1
        if random.random() < p_head:  # win
            capital += bet
        else:  # lose
            capital -= bet
        # No intermediate rewards in this formulation
    # Loop ends when capital == 0 or capital == goal
    return capital >= goal  # True if reached goal, False if busted

# Test one episode
result = simulate_gamble(5, 10, p_head=0.5)
print("Episode result:", "Reached goal!" if result else "Broke :(")

Episode result: Broke :(

Run the above cell a few times to see different outcomes (sometimes the gambler wins, sometimes loses).

Now, let’s estimate the probability of winning from the starting capital of 5 with this policy by simulating many episodes:

import math

trials = 10000
wins = 0
for _ in range(trials):
    if simulate_gamble(5, 10, p_head=0.5):
        wins += 1
win_prob = wins / trials
print(f"Estimated probability of reaching $10 starting from $5 (bet $1 policy) = {win_prob:.3f}")

Estimated probability of reaching $10 starting from $5 (bet $1 policy) = 0.498

With a fair coin and symmetric game, you might expect ~50% chance to reach the goal when starting at half the goal (in fact, for an unbiased random walk, the probability of hitting +5 before -5 starting at 0 is 50%). Our simulation likely shows close to 0.5.

This probability is essentially the value of the state “capital=5” under the current policy (always bet 1). In prediction terms, we’ve estimated $v_{\pi}(5)$ for our policy $\pi$ (always bet 1) as ~0.5.

We can similarly estimate $v_{\pi}(s)$ for other starting states s by changing start_capital.

Exercise: Try simulating from start_capital=1 and start_capital=9 to see the win probabilities. (We expect $v_{\pi}(1)$ to be much lower than $v_{\pi}(9)$, since with only $\$1$ the gambler is one loss from ruin, whereas with $\$9$ they’re one step from the goal.)

Prediction: State Value Function

For this gambling game, the state value $v_{\pi}(s)$ is the probability of reaching the goal from state $s$ under policy $(because we set reward 1 for success, 0 otherwise, so the expected return equals the success probability).

We can approximate the entire value function for our policy by simulation:

def estimate_values(policy_func, goal, p_head=0.5, trials=10000):
    """
    Estimate value of each capital state 0..goal for a given policy by simulation.
    policy_func(state) -> bet amount
    Returns a dict of state -> estimated value.
    """
    values = {s: 0.0 for s in range(goal+1)}
    for s in range(1, goal):  # exclude terminal states 0 and goal
        wins = 0
        for _ in range(trials):
            # simulate from state s
            capital = s
            while 0 < capital < goal:
                bet = policy_func(capital)
                if random.random() < p_head:
                    capital += bet
                else:
                    capital -= bet
            if capital >= goal:
                wins += 1
        values[s] = wins / trials
    values[0] = 0.0
    values[goal] = 1.0
    return values

# Define our always-bet-1 policy function
policy_bet1 = lambda capital: 1

val_estimates = estimate_values(policy_bet1, goal=10, p_head=0.5, trials=2000)
print("Estimated value of each state (always bet 1 policy):")
for s in range(0, 11):
    print(f"v({s}) ≈ {val_estimates[s]:.2f}")

Estimated value of each state (always bet 1 policy):
v(0) ≈ 0.00
v(1) ≈ 0.10
v(2) ≈ 0.19
v(3) ≈ 0.31
v(4) ≈ 0.40
v(5) ≈ 0.50
v(6) ≈ 0.58
v(7) ≈ 0.71
v(8) ≈ 0.80
v(9) ≈ 0.91
v(10) ≈ 1.00

You should see the values increasing as the capital gets higher (closer to the goal). State 0 has value 0 (game lost) and state 10 has value 1 (goal achieved). For a fair coin, the value might roughly increase linearly with the capital (since each $\$1$ is like one step closer to victory on average). In fact, for the fair coin random walk, the true $v_{\pi}(s)$ should be $s/10$ in this scenario (starting with $s$ dollars out of 10)

Control: Comparing Different Policies

Now we have a way to evaluate a given policy (by simulation). The next step is control: finding a better policy. Let’s compare two policies: - $\pi_1$: Always bet 1 (we evaluated this). - $\pi_2$: Always bet all-in (bet all current capital each time).

Intuition: - $\pi_2$ is riskier but reaches the goal faster if it wins. For a fair coin, any bet size yields the same success probability in theory (actually for p=0.5, any strategy gives $s/Goal$ chance). But for a biased coin, the optimal strategy changes. - Let’s test with a biased coin, p = 0.4 (40% chance to win each bet, so the odds are against the gambler). In that case, a high-variance strategy (betting big) might increase success chances

def simulate_policy(start_capital, goal, p_head, policy_func):
    capital = start_capital
    while 0 < capital < goal:
        bet = policy_func(capital)
        if random.random() < p_head:
            capital += bet
        else:
            capital -= bet
    return capital >= goal

def estimate_win_prob(start_capital, goal, p_head, policy_func, trials=5000):
    wins = 0
    for _ in range(trials):
        if simulate_policy(start_capital, goal, p_head, policy_func):
            wins += 1
    return wins / trials

# Define policy functions:
policy_all_in = lambda capital: capital  # bet all current capital

# Estimate win probabilities for start=5, goal=10, p=0.4
win_prob_bet1 = estimate_win_prob(5, 10, p_head=0.4, policy_func=policy_bet1, trials=10000)
win_prob_allin = estimate_win_prob(5, 10, p_head=0.4, policy_func=policy_all_in, trials=10000)
print(f"Policy π1 (bet 1) win probability ≈ {win_prob_bet1:.3f}")
print(f"Policy π2 (all-in) win probability ≈ {win_prob_allin:.3f}")

Policy π1 (bet 1) win probability ≈ 0.116
Policy π2 (all-in) win probability ≈ 0.392

It’s likely that the all-in policy has a higher win probability when p < 0.5. For example, starting with $\$5$, goal $10, p=0.4$: - Bet-1 policy requires winning 5 times before losing 5 times (very low probability). - All-in policy needs one win of 5 (0.4 chance). So $\pi_2$ might show ~0.4 win prob, whereas $\pi_1$ could be much lower (the gambler will probably bust most of the time due to needing consecutive wins).

Result Discussion: The better policy is the one with higher win probability (or higher expected reward). In this case, all-in seems better for p=0.4. If p were 0.6 (favorable odds), the opposite would be true: betting small repeatedly would yield a higher overall chance to eventually accumulate wins (because the odds are in your favor, you want many small bets to reduce variance). This simple comparison illustrates policy improvement: if we find a policy that yields higher value for the starting state, it’s a better policy. In reinforcement learning, we often iteratively improve a policy by looking at such comparisons.

This simple comparison illustrates policy improvement: if we find a policy that yields higher value for the starting state, it’s a better policy. In reinforcement learning, we often iteratively improve a policy by looking at such comparisons.

Toward Optimal Policy

What is the optimal policy for the gambler? We won’t derive it fully here (that’s an exercise for Dynamic Programming later), but intuitively: - If the coin is unfair (p < 0.5), optimal strategy is to take risky bets (high variance) to maximize the chance of a lucky streak - If the coin is in your favor (p > 0.5), optimal is to bet small increments (low variance) to steadily accumulate wins - If $p = 0.5$ exactly, many strategies yield the same 50% success probability, so the “optimal” is not unique.

Next, we’ll explore a different classic RL problem: the multi-armed bandit, which focuses on the exploration-exploitation trade-off in a simpler setting (no multiple states, just choosing actions for immediate reward).

← Part 0 | All labs | Part 2 →