Now that we’ve covered basic probability, let’s dive into a simple reinforcement learning scenario: a gambling game. This example will introduce core RL ideas of prediction (evaluating how good a given strategy is) and control (finding a better strategy). We will define a simple gambling environment and simulate an agent’s decisions within it.
Problem Setup: A Gambler’s Game
Imagine a gambler who starts with a certain amount of money and aims to reach a target amount. At each turn, the gambler can bet some of their money on a coin flip: - If the coin comes up Heads (win), the gambler wins the amount bet (it gets added to their capital). - If Tails (lose), the gambler loses the amount bet (subtracted from their capital). - The coin might be biased with probability p of Heads.
The game ends either when the gambler reaches the target (goal) or loses all money (bust).
This is known as the Gambler’s Problem from Sutton and Barto’s RL textbook. - States: The gambler’s capital (from \(0\) to \(Goal\)). - Actions: The bet amount (from \(1\) up to the smaller of current capital or amount needed to reach goal). - Transition: Win or lose based on the coin flip. - Reward: We can define a reward of +1 for reaching the goal, and 0 otherwise (no reward each step except at the end). This way, maximizing total reward is equivalent to maximizing the probability of reaching the goal.
For simplicity, let’s implement a smaller version: goal = $10, starting capital = $5, and a fair coin (p = 0.5). The gambler can bet $1 each turn (to keep it simple initially).
Simulation of a Fixed Policy
Policy: Let’s start with a simple policy: always bet $1 until the game ends.
We will simulate many episodes (games) with this policy to estimate: - Probability of reaching the goal (this is the value of the starting state under this policy). - The distribution of episode lengths (how many bets until finish).
Let’s write a simulation for one episode and then loop it for many episodes.
import randomdef simulate_gamble(start_capital, goal, p_head=0.5):""" Simulate one episode of the gambling game. Returns True if goal reached (win), False if gambler goes broke (lose). Policy: always bet 1. """ capital = start_capitalwhile0< capital < goal:# Always bet 1 bet =1if random.random() < p_head: # win capital += betelse: # lose capital -= bet# No intermediate rewards in this formulation# Loop ends when capital == 0 or capital == goalreturn capital >= goal # True if reached goal, False if busted# Test one episoderesult = simulate_gamble(5, 10, p_head=0.5)print("Episode result:", "Reached goal!"if result else"Broke :(")
Episode result: Broke :(
Run the above cell a few times to see different outcomes (sometimes the gambler wins, sometimes loses).
Now, let’s estimate the probability of winning from the starting capital of 5 with this policy by simulating many episodes:
import mathtrials =10000wins =0for _ inrange(trials):if simulate_gamble(5, 10, p_head=0.5): wins +=1win_prob = wins / trialsprint(f"Estimated probability of reaching $10 starting from $5 (bet $1 policy) = {win_prob:.3f}")
Estimated probability of reaching $10 starting from $5 (bet $1 policy) = 0.498
With a fair coin and symmetric game, you might expect ~50% chance to reach the goal when starting at half the goal (in fact, for an unbiased random walk, the probability of hitting +5 before -5 starting at 0 is 50%). Our simulation likely shows close to 0.5.
This probability is essentially the value of the state “capital=5” under the current policy (always bet 1). In prediction terms, we’ve estimated \(v_{\pi}(5)\) for our policy \(\pi\) (always bet 1) as ~0.5.
We can similarly estimate \(v_{\pi}(s)\) for other starting states s by changing start_capital.
Exercise: Try simulating from start_capital=1 and start_capital=9 to see the win probabilities. (We expect \(v_{\pi}(1)\) to be much lower than \(v_{\pi}(9)\), since with only \(\$1\) the gambler is one loss from ruin, whereas with \(\$9\) they’re one step from the goal.)
Prediction: State Value Function
For this gambling game, the state value \(v_{\pi}(s)\) is the probability of reaching the goal from state \(s\) under policy $(because we set reward 1 for success, 0 otherwise, so the expected return equals the success probability).
We can approximate the entire value function for our policy by simulation:
def estimate_values(policy_func, goal, p_head=0.5, trials=10000):""" Estimate value of each capital state 0..goal for a given policy by simulation. policy_func(state) -> bet amount Returns a dict of state -> estimated value. """ values = {s: 0.0for s inrange(goal+1)}for s inrange(1, goal): # exclude terminal states 0 and goal wins =0for _ inrange(trials):# simulate from state s capital = swhile0< capital < goal: bet = policy_func(capital)if random.random() < p_head: capital += betelse: capital -= betif capital >= goal: wins +=1 values[s] = wins / trials values[0] =0.0 values[goal] =1.0return values# Define our always-bet-1 policy functionpolicy_bet1 =lambda capital: 1val_estimates = estimate_values(policy_bet1, goal=10, p_head=0.5, trials=2000)print("Estimated value of each state (always bet 1 policy):")for s inrange(0, 11):print(f"v({s}) ≈ {val_estimates[s]:.2f}")
Estimated value of each state (always bet 1 policy):
v(0) ≈ 0.00
v(1) ≈ 0.10
v(2) ≈ 0.19
v(3) ≈ 0.31
v(4) ≈ 0.40
v(5) ≈ 0.50
v(6) ≈ 0.58
v(7) ≈ 0.71
v(8) ≈ 0.80
v(9) ≈ 0.91
v(10) ≈ 1.00
You should see the values increasing as the capital gets higher (closer to the goal). State 0 has value 0 (game lost) and state 10 has value 1 (goal achieved). For a fair coin, the value might roughly increase linearly with the capital (since each \(\$1\) is like one step closer to victory on average). In fact, for the fair coin random walk, the true \(v_{\pi}(s)\) should be \(s/10\) in this scenario (starting with \(s\) dollars out of 10)
Control: Comparing Different Policies
Now we have a way to evaluate a given policy (by simulation). The next step is control: finding a better policy. Let’s compare two policies: - \(\pi_1\): Always bet 1 (we evaluated this). - \(\pi_2\): Always bet all-in (bet all current capital each time).
Intuition: - \(\pi_2\) is riskier but reaches the goal faster if it wins. For a fair coin, any bet size yields the same success probability in theory (actually for p=0.5, any strategy gives \(s/Goal\) chance). But for a biased coin, the optimal strategy changes. - Let’s test with a biased coin, p = 0.4 (40% chance to win each bet, so the odds are against the gambler). In that case, a high-variance strategy (betting big) might increase success chances
def simulate_policy(start_capital, goal, p_head, policy_func): capital = start_capitalwhile0< capital < goal: bet = policy_func(capital)if random.random() < p_head: capital += betelse: capital -= betreturn capital >= goaldef estimate_win_prob(start_capital, goal, p_head, policy_func, trials=5000): wins =0for _ inrange(trials):if simulate_policy(start_capital, goal, p_head, policy_func): wins +=1return wins / trials# Define policy functions:policy_all_in =lambda capital: capital # bet all current capital# Estimate win probabilities for start=5, goal=10, p=0.4win_prob_bet1 = estimate_win_prob(5, 10, p_head=0.4, policy_func=policy_bet1, trials=10000)win_prob_allin = estimate_win_prob(5, 10, p_head=0.4, policy_func=policy_all_in, trials=10000)print(f"Policy π1 (bet 1) win probability ≈ {win_prob_bet1:.3f}")print(f"Policy π2 (all-in) win probability ≈ {win_prob_allin:.3f}")
Policy π1 (bet 1) win probability ≈ 0.116
Policy π2 (all-in) win probability ≈ 0.392
It’s likely that the all-in policy has a higher win probability when p < 0.5. For example, starting with \(\$5\), goal \(10, p=0.4\): - Bet-1 policy requires winning 5 times before losing 5 times (very low probability). - All-in policy needs one win of 5 (0.4 chance). So \(\pi_2\) might show ~0.4 win prob, whereas \(\pi_1\) could be much lower (the gambler will probably bust most of the time due to needing consecutive wins).
Result Discussion: The better policy is the one with higher win probability (or higher expected reward). In this case, all-in seems better for p=0.4. If p were 0.6 (favorable odds), the opposite would be true: betting small repeatedly would yield a higher overall chance to eventually accumulate wins (because the odds are in your favor, you want many small bets to reduce variance). This simple comparison illustrates policy improvement: if we find a policy that yields higher value for the starting state, it’s a better policy. In reinforcement learning, we often iteratively improve a policy by looking at such comparisons.
This simple comparison illustrates policy improvement: if we find a policy that yields higher value for the starting state, it’s a better policy. In reinforcement learning, we often iteratively improve a policy by looking at such comparisons.
Toward Optimal Policy
What is the optimal policy for the gambler? We won’t derive it fully here (that’s an exercise for Dynamic Programming later), but intuitively: - If the coin is unfair (p < 0.5), optimal strategy is to take risky bets (high variance) to maximize the chance of a lucky streak - If the coin is in your favor (p > 0.5), optimal is to bet small increments (low variance) to steadily accumulate wins - If \(p = 0.5\) exactly, many strategies yield the same 50% success probability, so the “optimal” is not unique.
Next, we’ll explore a different classic RL problem: the multi-armed bandit, which focuses on the exploration-exploitation trade-off in a simpler setting (no multiple states, just choosing actions for immediate reward).