Bandit Exploration Orderings

Input: $ARGUMENTS

Overview

When you have multiple options with uncertain payoffs, you face the explore-exploit tradeoff: try new things (explore) or stick with what works (exploit). Multi-armed bandit algorithms provide principled orderings for this tradeoff.

Core Principle

Early on, explore broadly. As information accumulates, shift toward exploiting the best-known options. The optimal balance depends on how many decisions remain.

Ordering Rules

Rule 1: Epsilon-Greedy — Mostly Exploit, Sometimes Explore

With probability (1-ε): choose the best-known option
With probability ε: choose randomly among all options
Start with high ε (~0.3), decrease over time
When: simple problems, many decisions, low cost of exploration

Rule 2: Upper Confidence Bound (UCB) — Optimism Under Uncertainty

Score each option: estimated value + uncertainty bonus
Choose the option with highest score
Uncertainty bonus decreases as you sample more
When: want principled exploration, can estimate confidence intervals

Rule 3: Thompson Sampling — Sample from Beliefs

Maintain a probability distribution over each option’s value
Sample from each distribution, choose highest sample
Naturally balances: uncertain options get explored, good options get exploited
When: can maintain Bayesian beliefs, want adaptive exploration

Rule 4: Successive Elimination — Remove Losers

Give each option minimum samples
Eliminate options that are statistically worse than the best
Concentrate remaining budget on survivors
When: want to identify the best option efficiently

Application Procedure

Step 1: Frame as Bandit Problem

What are the “arms” (options)?
What is the “reward” (payoff)?
How many “pulls” (decisions) do you have?
Is exploration costly? Risky?

Step 2: Choose Strategy

Few decisions remaining → exploit (little time to benefit from exploration)
Many decisions → explore (information has long-term value)
High variance between options → explore more (bigger upside from finding the best)
Low cost of failure → explore freely

Step 3: Execute and Update

Choose option per strategy
Observe outcome
Update beliefs about that option
Repeat with updated beliefs

Anti-Patterns

Pure exploitation (never trying new things — stuck on local optimum)
Pure exploration (never committing — wasting known-good options)
Exploring when you’re out of time to benefit from what you learn

When to Use

A/B testing and optimization
Resource allocation across uncertain options
Career decisions (try new roles vs deepen current)
Any sequential decision with learning

Verification

Options and payoffs identified
Time horizon considered
Exploration budget appropriate for remaining decisions
Beliefs updated after each observation

be - Bandit Exploration Orderings