Part 2: Multi-Armed Bandits

Multi-armed bandits (MABs) are a simpler class of problems that capture the fundamental challenge of exploration vs. exploitation in reinforcement learning. The scenario is like a gambler in front of several slot machines (“one-armed bandits”): each pull of a lever gives a stochastic reward. The gambler (agent) wants to maximize reward over time by choosing which machines to play, balancing trying new machines (exploration) and sticking with the best one found so far (exploitation).

Problem Setup

You have K slot machines (arms). Each arm, when pulled, gives a reward drawn from a fixed distribution (unknown to you). For simplicity, let’s say each arm gives a reward of 1 with some probability (and 0 otherwise) – a Bernoulli bandit.
Your goal is to maximize the total reward over a series of pulls (trials).

Key challenge: Exploration vs. Exploitation - Exploration: Try different arms to gain information about their payoff. - Exploitation: Use the information to pick the arm with the highest known reward rate so far.

We will implement and compare three strategies: 1. \(\epsilon\)-Greedy – with probability ε, explore (random arm), otherwise exploit (best arm) 2. Upper Confidence Bound (UCB) – select the arm with the highest upper confidence bound on reward (favoring arms with less trials so far to explore uncertainty 3. Thompson Sampling – a Bayesian approach: maintain a distribution for each arm’s success probability and sample from these to decide (naturally balancing exploration and exploitation)

Let’s set up a bandit problem to test these. We will create a bandit with, say, 3 arms: - Arm 0: win probability 0.3 - Arm 1: win probability 0.5 - Arm 2: win probability 0.7

The best arm is #2 (70% chance of reward), but the agents don’t know that initially.

import random

# True probabilities for each arm (unknown to agent)
true_probs = [0.3, 0.5, 0.7]
K = len(true_probs)

We’ll simulate a sequence of 1000 pulls. At each step, the agent chooses an arm according to the strategy, and gets a reward 1 (with the arm’s true probability) or 0. We will track: - The cumulative reward over time. - The number of pulls of each arm (to see exploration).

ε-Greedy Strategy

The ε-greedy strategy is simple and widely used: - We maintain an estimate of each arm’s value (e.g., average reward observed). - On each step, with probability ε choose a random arm (explore), otherwise choose the arm with highest estimated value (exploit). - After getting the reward, update the arm’s value estimate (incremental average).

Let’s implement ε-greedy:

def run_epsilon_greedy(true_probs, epsilon=0.1, trials=1000):
    # Initialize estimates and counts
    counts = [0] * K # number of times each arm is sampled
    values = [0.0] * K  # estimated value (mean reward) for each arm
    rewards_history = []
    total_reward = 0

    for t in range(1, trials+1):
        # Decide arm
        if random.random() < epsilon:
            arm = random.randrange(K)  # explore
        else:
            # exploit (argmax of values, tie-break randomly)
            max_val = max(values)
            best_arms = [i for i, v in enumerate(values) if v == max_val]
            arm = random.choice(best_arms)
        # Pull arm
        reward = 1 if random.random() < true_probs[arm] else 0
        # Update counts and value estimate
        counts[arm] += 1
        # Incremental average update for mean:
        values[arm] += (reward - values[arm]) / counts[arm]
        # Track total reward
        total_reward += reward
        rewards_history.append(total_reward)
    return rewards_history, counts, values

# Run epsilon-greedy
history_eps, counts_eps, values_eps = run_epsilon_greedy(true_probs, epsilon=0.1, trials=90000)
print("Epsilon-Greedy results:")
print("Arm counts:", counts_eps)
print("Estimated values:", [f"{v:.2f}" for v in values_eps])
print("Total reward:", history_eps[-1])

Epsilon-Greedy results:
Arm counts: [3024, 3220, 83756]
Estimated values: ['0.30', '0.50', '0.70']
Total reward: 61325

Question: What happens if ε is too high or too low? Feel free to try different ε values in run_epsilon_greedy to see the effect on total reward.

Upper Confidence Bound (UCB) Strategy

UCB is a more advanced strategy that addresses a limitation of ε-greedy: ε-greedy explores blindly. UCB uses a confidence interval approach: it picks the arm with the highest upper confidence bound on the estimated reward.One common formula (UCB1) for each arm a at time t is: \[ \text{UCB}_a = \hat{Q}_a + \sqrt{\frac{2 \ln t}{N_a}} \] where: - \(\hat{Q}_a\) is the current estimated value (mean reward) of arm a. - \(N_a\) is how many times arm a has been pulled. - \(N_a\) is how many times arm a has been pulled.

We will implement UCB1. Note: We need to pull each arm at least once initially to get an estimate (and avoid division by zero).

import math

def run_ucb(trials=1000):
    counts = [0] * K
    values = [0.0] * K 
    rewards_history = []
    total_reward = 0

    # Initial phase: play each arm once
    for arm in range(K):
        reward = 1 if random.random() < true_probs[arm] else 0
        counts[arm] += 1
        values[arm] += reward  # single sample mean is the reward itself
        total_reward += reward
        rewards_history.append(total_reward)
    # Now do the remaining trials
    for t in range(K+1, trials+1):
        # Compute UCB for each arm
        ucb_values = []
        for arm in range(K):
            # Exploitation term = values[arm]
            # Exploration term = sqrt(2 * ln(t) / counts[arm])
            exploration_bonus = math.sqrt(2 * math.log(t) / counts[arm])
            ucb_values.append(values[arm] + exploration_bonus)
        # Select arm with highest UCB
        arm = ucb_values.index(max(ucb_values))
        # Pull arm
        reward = 1 if random.random() < true_probs[arm] else 0
        counts[arm] += 1
        # Update mean estimate
        values[arm] += (reward - values[arm]) / counts[arm] # incremental mean update
        total_reward += reward
        rewards_history.append(total_reward)
    return rewards_history, counts, values

history_ucb, counts_ucb, values_ucb = run_ucb(trials=1000)
print("\nUCB results:")
print("Arm counts:", counts_ucb)
print("Estimated values:", [f"{v:.2f}" for v in values_ucb])
print("Total reward:", history_ucb[-1])


UCB results:
Arm counts: [71, 136, 793]
Estimated values: ['0.37', '0.49', '0.68']
Total reward: 634

The formula values[arm] += (reward - values[arm]) / counts[arm] keeps a running average in the UCB algorithm. If you have an old average m over n samples, and you add a new value x, the new average is (m * n + x) / (n + 1). This pattern pops up elsewhere—reinforcement learning, online learning, even basic statistics.

3 Thompson Sampling

Thompson Sampling is a Bayesian approach: - Assume a prior distribution for each arm’s success probability. For a Bernoulli reward, a common prior is Beta(1,1) (uniform). - After each pull, update the posterior for that arm (Beta prior updated with observed successes/failures). - To choose an arm: sample a success probability from each arm’s current posterior, then pick the arm with the highest sample. This way, arms with more uncertainty or higher potential can win the selection sometimes, naturally balancing exploration and exploitation.

We’ll implement Thompson Sampling using Beta distributions for each arm. Python’s random.betavariate(alpha, beta) can sample from a Beta distribution.

def run_thompson(trials=10000):
    # Start with prior Beta(1,1) for each arm
    alpha = [1] * K
    beta = [1] * K
    counts = [0] * K
    values = [0.0] * K  # can track mean as successes / trials for info
    total_reward = 0
    rewards_history = []

    for t in range(1, trials+1):
        # Sample a probability for each arm from its Beta(alpha, beta)
        sampled_probs = [random.betavariate(alpha[a], beta[a]) for a in range(K)]
        arm = sampled_probs.index(max(sampled_probs))
        # Pull the chosen arm
        reward = 1 if random.random() < true_probs[arm] else 0
        counts[arm] += 1
        # Update alpha, beta (posterior update: alpha += reward, beta += (1-reward))
        alpha[arm] += reward
        beta[arm] += (1 - reward)
        # Update value estimate for tracking
        values[arm] += (reward - values[arm]) / counts[arm]
        total_reward += reward
        rewards_history.append(total_reward)
    return rewards_history, counts, values

history_th, counts_th, values_th = run_thompson(trials=1000)
print("\nThompson Sampling results:")
print("Arm counts:", counts_th)
print("Estimated values:", [f"{v:.2f}" for v in values_th])
print("Total reward:", history_th[-1])


Thompson Sampling results:
Arm counts: [6, 27, 967]
Estimated values: ['0.00', '0.41', '0.72']
Total reward: 704

avg_reward_eps = history_eps[-1] / len(history_eps)
avg_reward_ucb = history_ucb[-1] / len(history_ucb)
avg_reward_th = history_th[-1] / len(history_th)
print("\nAverage reward per pull:")
print(f"Epsilon-Greedy: {avg_reward_eps:.3f}")
print(f"UCB: {avg_reward_ucb:.3f}")
print(f"Thompson Sampling: {avg_reward_th:.3f}")


Average reward per pull:
Epsilon-Greedy: 0.681
UCB: 0.634
Thompson Sampling: 0.704

ε-Greedy: Simple, ensures some exploration. But it doesn’t reduce exploration over time (unless ε is decayed), so it might waste some pulls exploring even when fairly confident
UCB: Uses a principled approach to exploration. It explores less over time as it gains confidence, focusing on uncertain arms until confidence bounds narrow
Thompson Sampling: A Bayesian method that often performs as well or better by naturally balancing exploration according to uncertainty. It’s a bit more complex (requires sampling and thinking in terms of distributions), but very powerful.

All these methods are important in reinforcement learning and online decision-making. They illustrate the core exploration-exploitation trade-off in a simple setting (no state transitions, just repeated actions). In a full RL problem, similar concepts apply when an agent tries actions in an environment and must explore new actions while exploiting what it knows.Next, we will move to Dynamic Programming for RL, where we assume we know the environment dynamics and see how to compute optimal policies by planning (instead of learning from data).

← Part 1 | All labs | Part 3 →