Nov 17, 2025 ~25 min By Vitor Sousa

Implementing Contextual Bandits: Complete Algorithm Guide

Part 3 of 5: From Theory to Code

TL;DR: This post provides complete, production-ready implementations of the essential contextual bandit algorithms: ε-greedy (baseline), UCB/LinUCB (confidence-based), and Thompson Sampling (Bayesian). Includes algorithm selection guide with default hyperparameters, when to use each method, and practical tuning strategies.

Reading time: ~25 minutes

Introduction: From Theory to Practice

Parts 1 and 2 covered when to use contextual bandits and the mathematical foundations. Now it’s time to implement.

This post provides complete, working code for the core algorithms you’ll actually deploy:

ε-greedy: Simplest baseline (start here)
UCB/LinUCB: Confidence-based exploration (production workhorse)
Thompson Sampling: Bayesian posterior sampling (often best empirically)

Each algorithm includes:

Full Python implementation
When to use it
Hyperparameter defaults and tuning
Strengths and limitations

By the end, you’ll know which algorithm fits your problem and have code ready to run.

Algorithm Selection: Quick Start Guide

Before diving into implementations, use this table to pick your starting point:

Your Situation	Recommended Algorithm	Default Hyperparameters	Why This Works
Few actions (K < 20), no context	ε-greedy or UCB1	ε = 0.1 with decay 1/t	Simple, interpretable, proven
Few actions, simple context (d < 50)	LinUCB	α = 1.0, λ = 1.0	Efficient, confidence-based exploration
Medium context (d = 50-500), linear	LinUCB or Linear Thompson	α = 1.0, λ = 1.0, σ = 1.0	Balance of efficiency and exploration
High-dim context (d > 500), linear	LinUCB with feature selection	α = 0.5, λ = 10.0	Regularization prevents overfitting
Non-stationary environment	ε-greedy (constant) or Discounted TS	ε = 0.15, γ = 0.9999	Continuous exploration adapts to drift
Need interpretability	LinUCB	α = 1.0, λ = 1.0	Can examine learned weights
Maximum performance	Thompson Sampling	Use defaults from algorithm	Best empirical results

Start simple: Begin with ε-greedy to validate your setup. Upgrade to LinUCB or Thompson Sampling once you’ve confirmed the basics work.

ε-greedy: The Simplest Baseline

ε-greedy is the go-to baseline. It’s easy to implement, easy to explain to stakeholders, and works reasonably well.

Algorithm

At each round t:
1. With probability ε: Choose random action (explore)
2. With probability 1-ε: Choose action with highest estimated reward (exploit)
3. Observe reward, update estimates

Mathematical Formulation

Part 1 When to Use Contextual Bandits: The Decision Framework
Part 2 Contextual Bandit Theory: Regret Bounds and Exploration
Part 3 Implementing Contextual Bandits: Complete Algorithm Guide
Part 4 Neural Contextual Bandits for High-Dimensional Data
Part 5 Deploying Contextual Bandits: Production Guide and Offline Evaluation

Keep Reading

Comparison diagram showing PPO with value network versus GRPO with group-based advantage estimation

GRPO: Eliminating the Value Network

Group Relative Policy Optimization replaces PPO's learned value function with a simple insight: sample multiple outputs and use their relative rewards as advantages. 33% memory savings, simpler implementation, and the algorithm powering DeepSeek-R1.

Series

Policy Optimization for LLMs: From Fundamentals to Production Part 3

Feb 3, 2026 ~32 min

Read article

Diagram showing PPO four-model architecture for LLM training

PPO for Language Models: The RLHF Workhorse

Deep dive into Proximal Policy Optimization—the algorithm behind most LLM alignment. Understand trust regions, the clipped objective, GAE, and why PPO's four-model architecture creates problems at scale.

Series

Policy Optimization for LLMs: From Fundamentals to Production Part 2

Jan 18, 2026 ~28 min

Read article

Diagram showing the reinforcement learning loop applied to language model fine-tuning

Reinforcement Learning Foundations for LLM Alignment

Master the RL fundamentals powering modern LLM training: from MDPs and policy gradients through value functions and actor-critic methods. The mathematical foundations you need before diving into PPO, GRPO, and beyond.

Series

Policy Optimization for LLMs: From Fundamentals to Production Part 1

Jan 11, 2026 ~35 min

Read article

Diagram showing the production architecture for contextual bandits deployments

Deploying Contextual Bandits: Production Guide and Offline Evaluation

Systems design, offline evaluation, and monitoring strategies for running contextual bandits safely in production.

Series

Adaptive Optimization at Scale: Contextual Bandits from Theory to Production Part 5

Nov 21, 2025 24 min read

Read article

Neural network architecture diagram for contextual bandits

Neural Contextual Bandits for High-Dimensional Data

When linear models fail, neural networks step in. Learn when to use neural bandits, how to quantify uncertainty with bootstrap ensembles, and handle high-dimensional action spaces with embeddings and two-stage selection.

Series

Adaptive Optimization at Scale: Contextual Bandits from Theory to Production Part 4

Nov 19, 2025 ~22 min

Read article