In reinforcement learning, exploration and exploitation represent a fundamental trade-off: exploration involves choosing actions that are currently uncertain but may yield higher rewards in the future, while exploitation involves selecting actions that are currently believed to be optimal. Balancing these two is a key challenge in reinforcement learning.

Understanding Regret

In the multi-armed bandit problem, the action value $Q(a) = \mathbb{E}[r \mid a]$ is defined as the expected reward for taking action $a$. The optimal value in the environment can then be defined as $V^* = Q(a^*) = \max Q(a)$. Based on this, the cumulative regret over the first $T$ steps is defined as:

$$ L_T = \mathbb{E}\left[\sum_{t=1}^T (V^* - Q(a_\tau))\right] $$

Note that the optimal value is a hypothetical metric used for theoretical analysis. In simulations, we can “know in advance” which arm is optimal—for example, by setting one arm’s reward distribution to $\mathcal{N}(0.8, 0.1^2)$ while others have lower rewards. In practical applications, regret can be approximated or used as a theoretical reference, such as through A/B testing, historical data, or pseudo-regret (where the best-performing arm in hindsight substitutes for the optimal value).

The regret increases with $t$, and the goal of bandit algorithms is to minimize its growth rate over time. The core challenge is finding the optimal balance between exploration and exploitation to identify the best strategy as early as possible.

Classic Exploration Strategies

Algorithm	Exploration Strategy	Regret Upper Bound
$\epsilon$-Greedy	Random exploration with probability $\epsilon$, otherwise greedy	$O(T)$ (if $\epsilon$ is fixed)
UCB (Upper Confidence Bound)	Selects the arm with the highest upper confidence bound	$O(\log T)$
Thompson Sampling	Bayesian exploration via posterior sampling	$O(\log T)$

$\epsilon$-Greedy

With probability $1-\epsilon$, it selects the arm with the highest current reward, and with probability $\epsilon$, it randomly chooses an arm. In practice, $\epsilon$ can start large and gradually decrease as rewards stabilize.

UCB

$\epsilon$-Greedy explores too randomly—even when sufficient samples show one arm is clearly better, it still explores suboptimal arms. UCB addresses this issue.

Core Idea

UCB follows the principle of optimism in the face of uncertainty:

Exploitation: Choose arms with high current average rewards.
Exploration: Prefer arms with fewer trials, as their reward estimates are less certain and may have higher potential.

UCB quantifies this uncertainty using confidence intervals and selects the arm with the highest upper confidence bound (UCB) at each step.

Mathematical Formulation

For $K$ arms, at time $t$, each arm $i$ has:

$n_i(t)$: Number of times arm $i$ has been selected.
$\hat{\mu}_i(t)$: Empirical average reward of arm $i$.

UCB selects arm $I_t$ as:

$$ I_t = \arg\max_{i} \left( \hat{\mu}_i(t) + \sqrt{\frac{2 \ln t}{n_i(t)}} \right) $$

Here:

$\hat{\mu}_i(t)$ represents exploitation.
$\sqrt{\frac{2 \ln t}{n_i(t)}}$ is the confidence radius, representing exploration. As $n_i(t)$ increases, this term shrinks, reflecting higher certainty.

Derivation

UCB is based on Hoeffding’s inequality, which provides a confidence interval for bounded random variables. Assuming rewards $X_i \in [0,1]$, the inequality states:

$$ P\left( \hat{\mu}_i(t) \geq \mu_i + \epsilon \right) \leq e^{-2 n_i(t) \epsilon^2} $$

Setting $\delta = t^{-4}$:

$$ e^{-2 n_i(t) \epsilon^2} = t^{-4} \implies \epsilon = \sqrt{\frac{2 \ln t}{n_i(t)}} $$

Thus, with high probability:

$$ \mu_i \leq \hat{\mu}_i(t) + \sqrt{\frac{2 \ln t}{n_i(t)}} $$

UCB selects the arm with the highest upper bound, i.e., the most optimistic estimate.

Steps

Initialization: Pull each arm once, update $n_i$ and $\hat{\mu}_i$.
Loop: For each $t = K+1, K+2, \dots, T$:
- Compute UCB for each arm:
  $$ \text{UCB}_i = \hat{\mu}_i + \sqrt{\frac{2 \ln t}{n_i}} $$
- Select the arm $I_t$ with the highest UCB.
- Observe reward $r_t$.
- Update statistics:
  - $n_{I_t} \leftarrow n_{I_t} + 1$
  - $\hat{\mu}_{I_t} \leftarrow \hat{\mu}_{I_t} + \frac{r_t - \hat{\mu}_{I_t}}{n_{I_t}}$

Performance

UCB’s regret grows logarithmically:

$$ R(T) \leq O\left( \sum_{i: \mu_i < \mu^*} \frac{\ln T}{\Delta_i} \right) + O(K) $$

where $\Delta_i = \mu^* - \mu_i$ is the gap between arm $i$ and the optimal arm.

Python Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


import numpy as np    
import math    
  
def ucb(K, T, rewards):    
    # Initialize    
    n = np.zeros(K)    
    mu = np.zeros(K)    
  
    # Initial phase    
    for i in range(K):    
        reward = rewards(i, n[i])    
        mu[i] = reward    
        n[i] += 1    
  
    # Main loop    
    for t in range(K, T):    
        ucb_values = mu + np.sqrt(2 * np.log(t) / n)    
        i = np.argmax(ucb_values)    
        reward = rewards(i, n[i])    
        mu[i] = (mu[i] * n[i] + reward) / (n[i] + 1)    
        n[i] += 1    
  
    return mu, n    
  
# Example    
K = 5    
T = 1000    
true_means = [0.1, 0.3, 0.5, 0.7, 0.9]    
def rewards(i, t):    
    return np.random.binomial(1, true_means[i])    
  
mu, n = ucb(K, T, rewards)    
print("Estimated means:", mu)    
print("Number of pulls:", n)    

Thompson Sampling

Thompson Sampling (TS) is a Bayesian bandit algorithm that selects arms by sampling from their reward probability distributions, making it ideal for Bernoulli rewards (e.g., click-through rates).

Core Idea

Maintain a probability distribution (e.g., Beta) for each arm’s reward.
Sample a value from each distribution and select the arm with the highest sample.
Update the distribution based on observed rewards.

This naturally balances:

Exploitation: Arms with higher estimated rewards are chosen more often.
Exploration: Arms with high uncertainty (wide distributions) still get sampled.

Mathematical Formulation

For Bernoulli rewards $r_i \sim \text{Bernoulli}(\mu_i)$, TS uses a Beta prior:

$$ \mu_i \sim \text{Beta}(\alpha_i, \beta_i) $$

$\alpha_i$: Success count for arm $i$.
$\beta_i$: Failure count for arm $i$.

Algorithm:

Initialize: $\alpha_i = 1$, $\beta_i = 1$ for all arms.
Loop: For each $t$:
- Sample $\theta_i \sim \text{Beta}(\alpha_i, \beta_i)$.
- Select $I_t = \arg\max_i \theta_i$.
- Observe $r_t \in \{0,1\}$.
- Update:
  - If $r_t=1$: $\alpha_{I_t} \leftarrow \alpha_{I_t} + 1$
  - Else: $\beta_{I_t} \leftarrow \beta_{I_t} + 1$

Comparison with UCB

Aspect	UCB	Thompson Sampling
Method	Confidence bounds (deterministic)	Probabilistic sampling
Exploration	Explicit confidence intervals	Implicit via distributions
Complexity	Low (UCB calculation)	Sampling, but efficient
Use Case	General distributions	Bernoulli/binomial rewards
Guarantees	Logarithmic regret	Logarithmic regret, often better empirically

Python Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


import numpy as np    
from scipy.stats import beta    
  
def thompson_sampling(K, T, true_means):    
    alpha = np.ones(K)    
    beta_ = np.ones(K)    
    rewards = np.zeros(K)    
  
    for t in range(T):    
        theta_samples = np.random.beta(alpha, beta_)    
        chosen_arm = np.argmax(theta_samples)    
        reward = np.random.binomial(1, true_means[chosen_arm])    
        alpha[chosen_arm] += reward    
        beta_[chosen_arm] += (1 - reward)    
        rewards[chosen_arm] += reward    
  
    return alpha, beta_, rewards    
  
# Example    
K = 5    
T = 1000    
true_means = [0.1, 0.3, 0.5, 0.7, 0.9]    
alpha, beta_, rewards = thompson_sampling(K, T, true_means)    
  
print("Alpha (successes):", alpha)    
print("Beta (failures):", beta_)    
print("Total rewards:", rewards)    
print("Estimated means:", alpha / (alpha + beta_))    

References

Exploration Strategies in Deep Reinforcement Learning | Lil’Log

Reinforcement Learning Lecture 9: Exploration vs. Exploitation - Zhihu