Yash's Blog

Mixtral of Experts: Top-2 Routing Gives 47B Capacity at 13B Active Compute

Yash Patel — Sun, 22 Mar 2026 10:30:00 GMT

At roughly 2.08e11 cumulative FLOPs in my own run, a dense baseline lands at 25.31 validation perplexity. At nearly the same compute budget, a sparse MoE lands at 20.98. The absolute delta is 4.324, a 17.09% reduction in validation perplexity at almost the same training compute.

Before sparse MoE layers became practical, the scaling path was blunt. If I wanted more capability, I paid for a bigger dense feedforward stack on every token, every layer, every step. Training cost rose. Inference cost rose. Memory pressure rose. Production teams got trapped in a false binary: either ship a smaller model that misses quality targets, or ship a larger one with painful latency and infrastructure bills.

Mixtral addresses that exact bottleneck. Instead of running one dense FFN per token, each FFN block is replaced by a router plus multiple experts. The router picks only two experts for each token state. The model keeps large total parameter capacity while keeping per-token active compute bounded.

That is the key pre-paper pain and paper-level claim in one line: dense scaling ties quality to always-on compute, while sparse routing partially decouples them.

THE PAPER'S CLAIM

The paper frames dense transformer FFNs as the scaling bottleneck. Dense layers activate all FFN parameters for every token, so quality gains come with directly higher per-token compute. Mixtral's proposal is to replace dense FFNs with sparse MoE FFNs: 8 experts per layer, top-2 active experts per token, weighted recombination through router probabilities.

The central claim is not only architectural novelty. It is quality-per-compute at useful scale. The authors report Mixtral 8x7B outperforming Llama 2 70B on benchmarks such as MMLU, HellaSwag, and GSM8K, while approaching GPT-3.5-level results on many public evaluations. They pair that quality claim with an efficiency claim: 12.9B active parameters per token, not 46.7B, and materially faster inference than comparable dense 70B-class models.

Mechanism

Mixtral is still a decoder-only Transformer. The QKV path is unchanged from my attention walkthrough in attention-is-all-you-need. The architectural shift is local but deep: the feedforward block in each Transformer layer is replaced by a sparse mixture of experts block.

At system level, one token at layer 𝑙 does this:

Runs attention as usual.
Enters a router.
Router picks top-2 experts out of 8.
Token is processed by only those experts.
Expert outputs are weighted and added.

Component Breakdown

For each MoE layer:

Router: one linear map from hidden state to 8 logits.
Top-k selection: keep two highest-logit experts.
Expert FFNs: in Mixtral-style implementations, each expert is a SwiGLU MLP.
Aggregation: weighted sum of selected expert outputs.

In my artifact code, this lives in moe_core.py with MoEFeedForward and SwiGLUExpert. The dense baseline uses a SwiGLU FFN too, so dense and MoE FFN compute are compared within the same FFN family.

Worked Token Example

Take a token state vector 𝑥ₜ ∈ ℝ^d_model for the word "latency". Assume the router emits logits over 8 experts:

$$z = [2.9, 0.3, 1.7, -0.1, 2.2, 0.4, -1.2, 0.1]$$

Top-2 indices are experts 0 and 4 (logits 2.9 and 2.2). I softmax only over those two values:

$$p = \text{softmax}([2.9, 2.2]) = [0.668, 0.332]$$

Then only two expert MLPs run:

$$y_t = 0.668 \cdot E_0(x_t) + 0.332 \cdot E_4(x_t)$$

Experts 1,2,3,5,6,7 are skipped for this token at this layer.

Next token can pick a different pair. Next layer can pick another pair again. That dynamic token-wise specialization is where the extra capacity comes from.

Step-by-Step Computation Path

I break one forward pass into the exact sequence that matters for performance and stability:

Router projection computes 8 logits from one token hidden state.
Top-2 selection keeps expert IDs and gating weights.
Dispatch packs token activations by expert ID.
Each selected expert runs its own SwiGLU MLP on its assigned token slice.
Gather unpacks outputs to original token order.
Weighted combine applies gate weights and sums the two expert outputs.
Residual path adds the MoE output back to the layer stream.

Each step is simple in isolation. The complexity appears in step 3 and step 5 at scale, where token-expert imbalance directly affects kernel efficiency and tail latency.

During training, backpropagation flows through the selected experts and the router probabilities that produced those selections. During inference, the same dispatch path runs without gradient tracking, but the same load patterns still decide latency behavior.

Core Equations

The routing objective has two parts: sparse dispatch and balanced utilization.

First, sparse dispatch:

$$y_t = \sum_{i \in \text{TopK}(x_t)} p_i(x_t) E_i(x_t), \quad K=2$$

Why this form makes sense: I want conditional compute, but I still need differentiable weighted composition among active experts.

Second, load balancing (Switch-style top-1 auxiliary):

$$L_{aux} = N \sum_{i=1}^{N} f_i p_i$$

where $N$ is number of experts, $f_i$ is fraction of tokens routed to expert $i$ by top-1 assignment, and $p_i$ is mean router probability mass for expert $i$.

Why this form makes sense: if routing collapses to a few experts, $f_i$ and $p_i$ become skewed; minimizing this term pushes traffic and confidence toward a more uniform spread.

This exact equation is a Switch-style implementation choice in my artifact, not a Mixtral-specific closed-form requirement from the paper.

In my code, this is auxiliary_load_balancing_loss(...), and tests verify it against the top-1 formula directly.

FLOP Accounting Intuition

The most useful engineering question is not "how many total parameters?" It is "how much math did I execute per token?"

Dense FFN forward FLOPs (SwiGLU baseline):

$$\text{FLOPs}{dense\ffn} = 6 B T d{model} d{ff}$$

MoE active FFN forward FLOPs with SwiGLU experts (3 linear projections) and top-2 routing:

$$\text{FLOPs}_{moe\active\ffn} = 6 B T d{model} d{ff_per_expert} K$$

If $d_{ff_per_expert} = d_{ff}/8$ and $K=2$:

$$\frac{\text{FLOPs}_{moe\active\ffn}}{\text{FLOPs}{dense\ffn}} = \frac{6 \cdot (d{ff}/8) \cdot 2}{6 \cdot d{ff}} = \frac{1}{4}$$

So active MoE FFN math is about 25% of dense FFN math under this setup, before adding router overhead. That is exactly why FLOP-matched races are necessary. Parameter counts alone can mislead teams.

What is interesting is where the practical complexity moves. Dense models spend most complexity in matrix sizes. Sparse MoE models spend it in routing stability, token dispatch, and kernel/runtime behavior.

The Non-Obvious Part

The hard part is not top-2 math. The hard part is determining whether routing is actually specializing or just staying near-uniform.

In this smoke run, final entropy is 2.0276 (no-aux) and 2.0596 (with-aux), while ln(8) = 2.0794. That is 97.51% and 99.04% of maximum entropy, so experts are not strongly specialized yet at this scale. Aux mainly nudges routing slightly flatter rather than creating sharp specialization.

That is why I track three signals together: validation perplexity, top1_share, and entropy. Perplexity alone misses routing behavior. At small scale, the bigger risk signal is early-phase routing noise and mild imbalance, not a confirmed late-stage collapse event.

Reproduction

I implemented two runnable artifacts in pure PyTorch:

mixtral_isoflop_race.py: dense vs MoE race at equal cumulative FLOP budget.
mixtral_router_balance.py: no-aux vs with-aux routing behavior.

Shared components live in moe_core.py.

Hardware baseline: RTX 3090, 24GB VRAM. The scripts include explicit VRAM checks and CPU fallback.

Experiment Setup

For the smoke run used here:

Sequence length: 96
Batch size: 16
Model width: 192
Layers: 3
Experts: 8, top-k: 2
Dense FFN width: 768, per-expert width: 96

I logged checkpointed metrics to:

outputs/isoflop_race_metrics.csv
outputs/router_balance_metrics.csv

Result 1: FLOP-Matched Dense vs MoE

From the final checkpoints in isoflop_race_metrics.csv:

Dense final: flops=207,920,037,888 | val_ppl=25.307996
MoE final  : flops=204,918,681,600 | val_ppl=20.983645
Delta      : 4.324351 perplexity (17.09%) in favor of MoE

The MoE endpoint is at ~1.44% lower cumulative FLOPs than dense in this smoke run. So this is not a mathematically perfect iso-point, but it is close enough to test directionally valid quality-per-compute behavior.

But this run also exposes a confounder that has to be disclosed:

Dense endpoint: step=12 | tokens_seen=18,432 | wall_seconds=1.719
MoE endpoint  : step=25 | tokens_seen=38,400 | wall_seconds=55.906

At equal FLOP budget, MoE consumed 108.33% more optimizer steps (and token batches) because each MoE step is cheaper. Wall-clock was 32.52x slower in this implementation. So this result is compute-normalized, but not update-count-normalized or wall-time-normalized.

I ran it. MoE starts slightly behind at the lowest budget checkpoints, then overtakes and stays ahead through the endpoint.

Result 2: Router Balance With and Without Aux Loss

From router_balance_metrics.csv at step 300:

no_aux   : entropy=2.027621 | top1_share=0.182227 | val_ppl=10.259389
with_aux : entropy=2.059556 | top1_share=0.175938 | val_ppl=10.315585

Derived deltas:

Top-1 concentration drops by 0.006289 absolute, a 3.45% relative reduction.
Entropy increases by 0.031935.
Validation perplexity is slightly worse with aux in this run (difference 0.056196).

At this scale, both runs remain near-uniform (entropy close to ln(8)), so this is not strong expert specialization yet. Aux still improves balance modestly (lower top1_share and higher entropy), but that regularization did not translate into a perplexity win in this specific run.

So the practical win from aux here is routing smoothness, not immediate quality lift.

I want these caveats explicit:

This is a smoke-scale run, not full-scale pretraining.
Tokenizer is character-level and corpus is small compared to production pretraining mixes.
Single-seed evidence. Robustness claims need multi-seed and larger compute budgets.

Even with those caveats, the mechanism-level behavior is clear and reproducible.

I like to include one small raw-style checkpoint block because it forces honesty. It is easy to tell a clean story with only endpoints. It is harder when intermediate points are visible:

[step 250] no_aux:   val_ppl=10.4865 | top1_share=0.1769 | entropy=2.0358
[step 275] no_aux:   val_ppl=10.2653 | top1_share=0.1785 | entropy=2.0278
[step 300] no_aux:   val_ppl=10.2594 | top1_share=0.1822 | entropy=2.0276

[step 250] with_aux: val_ppl=10.4874 | top1_share=0.1765 | entropy=2.0637
[step 275] with_aux: val_ppl=10.2944 | top1_share=0.1773 | entropy=2.0609
[step 300] with_aux: val_ppl=10.3156 | top1_share=0.1759 | entropy=2.0596

What stands out is that with-aux keeps entropy higher, but both runs still sit close to uniform routing and with-aux does not improve validation perplexity in this smoke run. For production teams, this is a reminder to treat aux as a routing-control tool and validate quality impact separately.

Production Reality

As of March 2026, MoE is no longer a paper-only architecture. It is a deployment pattern. The stack has matured, but it has a specific shape that matters for engineering decisions.

Where it runs

From Mistral's release notes and serving ecosystem docs:

Mixtral introduced 46.7B total and 12.9B active parameters per token using top-2 routing.
Mistral explicitly pushed open-source deployment through vLLM integration with MegaBlocks kernels.
vLLM now exposes OpenAI-compatible APIs and includes explicit expert-parallel and routed-expert deployment paths in docs and examples.

So in practice, teams deploy MoE models behind the same API contracts as dense models, but with a very different runtime under the hood.

What changed since the paper

The highest-impact shift is kernel and runtime specialization.

In the paper view, MoE looks like a clean routing equation. In production, throughput depends on runtime handling of uneven token-to-expert assignments. This is why fused MoE kernels, dispatch optimizations, and expert-parallel communication strategies became first-class concerns in serving systems.

Second, deployment topology became part of model quality engineering.

Top-2 routing quality is not enough. Routing traffic also has to map cleanly to multi-GPU or multi-node topology. When prompt distributions create hot experts, latency tails widen even when average throughput still looks good.

Third, API compatibility got easier, model formatting got stricter.

OpenAI-compatible serving lowered integration friction. But real deployments still trip on tokenizer and chat-template mismatches, generation defaults, and request-shape differences across model families.

The production gotcha

The biggest MoE gotcha is memory versus active compute.

Active compute can look like a much smaller dense model, but total expert weights still have to be resident for efficient serving. Compute savings do not remove memory pressure. If capacity planning ignores that, deployments either under-provision VRAM or pay network penalties from aggressive weight sharding/offload.

My recommendation is to treat MoE rollout as two separate design problems:

Quality and compute economics at model level.
Routing traffic engineering at runtime level.

Teams that only solve the first one get surprised in production.

What I instrument before calling an MoE deployment healthy

For dense deployments, teams usually track latency, throughput, and error rate. For MoE, those are necessary but not sufficient. I add routing-aware telemetry to every serving stack:

Per-expert token load over time windows.
Top-1 share and entropy by routeable layer.
P50/P95/P99 latency split by prompt length bucket.
Cross-device transfer volume for expert dispatch.

This catches the most common silent failure: average latency looks fine, but one or two experts go hot and drive tail latency regressions. Under aggregate-only throughput monitoring, that failure mode can sit in production for weeks.

Rollout pattern that actually works

The safest MoE rollout I have seen is staged:

Shadow traffic with full routing metrics enabled.
Limited canary where latency SLOs are evaluated on tail, not mean.
Full rollout only after expert load distribution is stable across daily traffic cycles.

Dense-to-MoE migration fails when teams treat it like a drop-in model swap. It is a model swap plus a routing system rollout. If either side is weak, the launch degrades quickly.

THE Code

Code and outputs are in this folder mixtral-of-experts:

mixtral_isoflop_race.py: FLOP-budget race between dense and MoE models.
mixtral_router_balance.py: paired routing runs with and without auxiliary balancing.
moe_core.py: shared model blocks, router loss, FLOP estimators, and diagnostics.

Run with an RTX 3090 (24GB VRAM) for the intended experience. Smoke commands are in README.md and finish in minutes. Longer runs need a larger FLOP budget and step count to tighten confidence on the quality deltas.

LLaMA 2: How Three Borrowed Techniques Fit a 70B Model on Two GPUs

Yash Patel — Sun, 15 Mar 2026 09:00:00 GMT

The Memory Problem

Serving 10 concurrent users with a 70B-scale model at 4K context, using the vanilla transformer architecture from 2017, requires roughly 240GB of GPU memory: about 140GB for weights and about 100GB for the KV cache (the K and V tensors stored per token to avoid recomputing attention from scratch on each decode step). Two A100-80GBs give you 160GB. The math doesn't work.

Three techniques separate every modern open-weight model from the vanilla 2017 transformer: RoPE, RMSNorm, and GQA. None of them originated in LLaMA 2. RoPE came from Su et al. (2021). RMSNorm came from Zhang and Sennrich (2019). GQA came from Ainslie et al. (2023). What the LLaMA lineage did was consolidate them into a single coherent architecture that the entire open-source ecosystem built around. LLaMA 1 brought RoPE and RMSNorm into wide use. LLaMA 2 added GQA for its larger models (specifically the 34B and 70B) and doubled the context window to 4K. The 7B and 13B still use standard multi-head attention.

That is why every open-weight model released since mid-2023 shares the same config.json backbone: num_key_value_heads, rope_theta, rms_norm_eps. Not because LLaMA 2 invented them. Because LLaMA 2 was the first widely-adopted open model to ship all three together at scale.

Post 1 covered the attention mechanism: QKV projections, scaled dot product, causal mask. This post covers what those three techniques actually do and why the combination works. Together they explain why num_key_value_heads differs from num_attention_heads in most modern decoder configs, what raising rope_theta from 10,000 to 500,000 actually does to context length, and why the open-weight ecosystem almost universally abandoned LayerNorm after 2022.

The Paper's Claim

By mid-2023, the largest open-weight model was LLaMA 1 65B: competitive on several benchmarks, trained on 1.4T tokens, capped at 2K context, and requiring eight A100s for practical serving. Touvron et al. argued that three existing techniques, none original to Meta, could close most of the efficiency and quality gap: RoPE for position encoding that generalises to longer sequences, RMSNorm for stable training at scale, and GQA to cut the KV cache footprint by up to 8x at inference. LLaMA 2 70B, trained on 2T tokens with these three architectural changes, scores 68.9 on MMLU 5-shot, up 5.5 points from LLaMA 1 65B's 63.4 and within 1.1 points of GPT-3.5 at 70.0. GQA specifically brings the 70B's KV cache from around 100GB to around 12.5GB at a 10-user load, the number that makes two-GPU deployment realistic at that scale.

RoPE, RMSNorm, and GQA

A LLaMA 2 70B forward pass is the same transformer Vaswani et al. described in 2017 at the system level. Token embeddings enter, 80 identical decoder blocks process them via residual streams, and a final linear projection produces logits over the vocabulary. The causal mask, the QKV attention sub-layer, the feed-forward sub-layer, the skip connections: all unchanged. Three specific components inside each decoder block differ from the 2017 original: how position is injected into queries and keys before the dot product (RoPE, replacing sinusoidal encodings), how each sub-layer's inputs are stabilised before the linear projections (RMSNorm, replacing LayerNorm), and how key-value tensors are distributed across the 64 attention heads (GQA with 8 KV heads, replacing full MHA). Those three changes are the entire architectural delta from the vanilla transformer.

Three separate failure modes, three separate fixes. RoPE (Su et al. 2021) fixes positional encoding's length generalisation problem. RMSNorm (Zhang and Sennrich 2019) removes a GPU synchronisation barrier from every forward pass. GQA (Ainslie et al. 2023) cuts the KV cache footprint by up to 8x without touching the attention math.

Take the token "model" at position 42 in a 4,096-token input, processed by the first decoder block of LLaMA 2 70B (d_model=8192, 64 query heads, 8 KV heads, head dimension d_k=128). That single token passes through all three modified components before any output is produced. Here is what each one does.

RoPE: Position as Rotation

The original transformer encodes position by adding a sinusoidal vector to each token's embedding before the first attention layer. Expand the dot product between query q at position m and key k at position n:

$$(\mathbf{q} + \text{PE}[m]) \cdot (\mathbf{k} + \text{PE}[n]) = \mathbf{q} \cdot \mathbf{k} + \mathbf{q} \cdot \text{PE}[n] + \text{PE}[m] \cdot \mathbf{k} + \text{PE}[m] \cdot \text{PE}[n]$$

The cross terms q·PE[n] and PE[m]·k depend on absolute positions m and n individually. There is no factorization that reduces this to a function of relative offset (m - n) alone. A model trained at 2K tokens has no algebraic guarantee that the positional signal at positions 500 and 510 transfers to positions 5,500 and 5,510. The model learns an approximation from data. It fails cleanly beyond training length.

Rotary Position Embedding (RoPE), introduced by Su et al. in "RoFormer" (2021) and adopted by LLaMA 1, applies rotations to Q and K inside the attention operation, after the linear projections but before the dot product. Each consecutive dimension pair (x[2k], x[2k+1]) is treated as a 2D point and rotated by an angle proportional to position. The rotation angles θₖ = 10000^(−2k/d) are fixed, not learned. For a token at position m, dimension pair k receives rotation m·θₖ:

$$x_{\text{rot}}[2k] = x[2k]\cos(m,\theta_k) - x[2k+1]\sin(m,\theta_k)$$

$$x_{\text{rot}}[2k+1] = x[2k]\sin(m,\theta_k) + x[2k+1]\cos(m,\theta_k)$$

Rotation is isometric: the norm of the vector is preserved. Additive positional encodings inflate vector norms by the PE magnitude, shifting softmax distributions in proportion to position rather than content. Rotation keeps attention logit scale unchanged.

What's interesting is what happens to the dot product after applying this rotation to both Q and K: the result depends only on content and relative distance (m - n). The algebraic proof is clean. The consequence is meaningful: a model trained at 4K context sees the same rotational relationship between tokens 100 and 200 as between tokens 4100 and 4200. This is why context extension works by scaling the base frequency schedule rather than full retraining.

Concrete numbers using LLaMA 2's head dimension (dₖ = 128, so 64 dimension pairs). The token "model" at position 42 receives rotation 42 × θ₀ = 42.0 radians on its fastest dimension pair (k = 0, completing about 6.7 full cycles) and 42 × θ₆₃ ≈ 0.0049 radians on its slowest (k = 63, barely a nudge). The 8,600x frequency spread across 64 pairs is the mechanism: fast-rotating pairs distinguish nearby tokens within a few positions; slow-rotating pairs retain signal across hundreds of tokens. Both happen simultaneously, in different dimensions of the same 128-dim head.

One non-obvious consequence: position 0 always receives identity rotation (m=0 makes all angles zero). The first token in every sequence, typically a BOS token, carries zero positional information from RoPE. Its attention patterns are driven entirely by content geometry. This explains why layer-0 attention heads often appear to attend uniformly across the sequence: there is no positional bias at the sequence start, only content similarity.

Rotation angles freqs[position, k] mod 2π for 64 positions and 64 dimension pairs. Low-index pairs (bottom) complete multiple rotations within a short span. High-index pairs (top) are nearly flat. This frequency spread is what makes RoPE's locality bias multi-scale.

RMSNorm: Removing the Synchronisation Barrier

LayerNorm normalizes each activation vector in two sequential passes: subtract the mean (re-centering), then divide by the standard deviation (re-scaling). On GPU, the mean computation requires a barrier synchronisation across the full activation tensor before the variance step can begin. At large hidden dimensions and batch sizes, this barrier is a measurable throughput cost.

Root Mean Square Normalization (RMSNorm) drops re-centering entirely. Normalization is magnitude-only:

$$\text{RMSNorm}(\mathbf{x}) = \frac{\mathbf{x}}{\text{RMS}(\mathbf{x})} \cdot \gamma, \qquad \text{RMS}(\mathbf{x}) = \sqrt{\frac{1}{n}\sum_i x_i^2 + \varepsilon}$$

One pass. One learned scale γ per dimension. No mean subtraction, no learned shift parameter β. For the token "model" at position 42, this means the 8,192-dimensional activation vector entering each sub-layer is divided by its own RMS magnitude before the Q/K/V projections run. One scalar division, no synchronisation point. Zhang and Sennrich (2019) report 7-64% speedup over LayerNorm depending on architecture; for transformer blocks the gain is in the 7-9% range. At 80 layers across millions of training tokens, that compounds.

LLaMA 1 adopted RMSNorm as pre-norm: before each sub-layer, not after. The original transformer used post-norm. At 70B+ scale, post-norm produces unnormalized residuals that accumulate through the stack and reach the loss before any stabilization. Training diverges early. GPT-3 adopted pre-norm and documented the result. Every large model trained since followed.

GQA: Asymmetric Caching

Every autoregressive decode step computes Q, K, and V for the new token, appends K and V to the cache, then reads the entire cached K and V to compute attention over the full context. The query is used once and discarded. Keys and values persist for the full lifetime of the sequence.

Grouped Query Attention (GQA), introduced by Ainslie et al. (2023), is built around this asymmetry. Query heads are never cached, so the number of query heads has zero effect on KV cache size. Only KV heads consume persistent memory. LLaMA 2 adopted GQA for its 34B and 70B models; the 7B and 13B use standard MHA. The 70B uses 64 query heads and 8 KV heads: every group of 8 query heads shares the same K and V projections. The token "model" at position 42 has its K and V projections written to one of those 8 KV slots. All 8 query heads in the same group attend over that single K/V entry. The attention computation inside each group is unchanged. Quality loss is negligible: Ainslie et al. report ROUGE-1 degradation within 0.3 points on any individual dataset and 0.1 points on average at T5 XXL scale.

The formula makes the savings concrete:

$$\text{KV bytes} = 2 \times \text{batch} \times \text{layers} \times \text{seq_len} \times \text{num_kv_heads} \times \text{head_dim} \times \text{dtype_bytes}$$

At seq=4096, batch=10, float16:

Full MHA (64 KV heads):  2 × 10 × 80 × 4096 × 64 × 128 × 2  ≈  100 GB
GQA     ( 8 KV heads):  2 × 10 × 80 × 4096 ×  8 × 128 × 2  ≈  12.5 GB

The reason this matters at inference is memory bandwidth, not just total capacity. Each decode step reads the entire KV cache from DRAM. An A100 delivers 312 TFLOP/s of compute but only 2 TB/s of memory bandwidth. The compute units wait on the memory bus. GQA's 8x reduction in KV heads translates directly to 8x fewer bytes transferred per decode step. Latency at decode time tracks this reduction closely, since memory bandwidth is the binding constraint at the 70B scale.

The connection back to post 1: Flash Attention solves the O(n²) score matrix problem by rewriting the attention kernel. GQA solves a different problem. Even with Flash Attention eliminating the materialized score matrix, the KV cache still grows linearly with sequence length and linearly with num_kv_heads. Flash Attention does not touch the cache. GQA removes the second multiplier.

What I Built and What I Found

Artifact 1 implements RoPE in 200 lines of pure NumPy, with no PyTorch or HuggingFace dependency. Hardware: CPU-only, runtime around 45 seconds. The verification strategy uses the complex-number representation: z_rot = z * exp(i * theta) is mathematically equivalent to the real-valued 2D rotation matrix and serves as ground truth.

[Verification vs. complex reference]
  Max |q_complex - q_ours| : 0.00e+00
  Max |k_complex - k_ours| : 0.00e+00
  Expected: < 1e-14  (exact same computation, different notation)

Zero error. That is not an approximation matching a reference; it is two representations of the same geometric operation. The numerical proof of the relative-distance property over 20 random token pairs at arbitrary positions gives max error less than 1e-12. Floating-point rounding. The property is exact.

What the paper doesn't mention: the attention decay curve for sinusoidal PE is not monotonic. I measured E[|q · k|] as a function of relative distance d, averaging over 200 random unit-norm 128-dimensional vector pairs. RoPE decays from ~0.07 at d=0 to ~0.01 at d=127. Clean. Sinusoidal encoding behaves differently: pe[0] = [0,1,0,1,...] has norm $\sqrt{64} \approx 8$, which completely dominates unit-norm content vectors. The sinusoidal signal at d=0 sits near 1.0, overwhelmingly driven by position rather than content, then oscillates with no consistent directional trend across the 128-token window.

E[|q · k|] vs relative distance. RoPE (blue) decays monotonically from ~0.07. Sinusoidal (orange) starts near 1.0 because pe[0] dominates unit-norm content vectors, then shows no consistent directional decay. Log scale on the right confirms RoPE's clean monotonic tail.

A model trained with sinusoidal PE must learn locality bias entirely from data. RoPE provides it geometrically. That distinction matters most when extrapolating to sequence lengths not seen during training.

Artifact 2 allocates real PyTorch KV cache tensors on an RTX 3090 and measures torch.cuda.memory_allocated() directly. No model weights loaded. The benchmark sweeps four architectures across sequence lengths 128-4096 and computes maximum batch size under a 20GB VRAM budget.

Results at seq=4096, batch=1:

  GPT-2 XL  (MHA, 25 KV heads, 48 layers)  : 1.172 GiB
  LLaMA-2 7B (MHA, 32 KV heads, 32 layers)  : 2.000 GiB
  LLaMA-2 70B (GQA, 8 KV heads, 80 layers)  : 1.250 GiB
  MQA variant  (1 KV head,  80 layers)       : 0.156 GiB

  70B hypothetical MHA (64 KV heads):        10.000 GiB
  Actual 70B GQA  (8 KV heads):               1.250 GiB  (8.0x reduction)

Maximum batch under 20GB budget at seq=4096:

  GPT-2 XL     (MHA):  max_batch = 17
  LLaMA-2 7B   (MHA):  max_batch = 10
  LLaMA-2 70B  (GQA):  max_batch = 16
  70B hypothetical MHA: max_batch =  2

The 70B GQA model serves 16 concurrent sequences within 20GB. Without GQA, the same model serves 2. The 7B MHA model serves 10 sequences, fewer than the 70B GQA, because LLaMA 2 7B's 32 KV heads at 32 layers produce a larger per-token KV footprint than the 70B's 8 KV heads at 80 layers. Counter-intuitive: model parameter count is not the right proxy for KV memory footprint at inference. Head count and layer count are.

KV cache GiB at batch=1 across four architectures. The 70B GQA line (1.250 GiB at seq=4096) stays below the 7B MHA line (2.000 GiB) even though the model is 10x larger: 8 KV heads × 80 layers beats 32 KV heads × 32 layers.

Maximum concurrent sequences under a 20GB KV cache budget at seq=4096. MQA reaches 120+ at short contexts. The 70B GQA (16 sequences) outperforms the 7B MHA (10 sequences), and gives 8x more capacity than hypothetical 70B MHA (2 sequences).

I found that measured memory sits slightly above the analytical formula for small tensors. At seq=128 for the GPT-2 XL configuration, the empirical-to-theoretical ratio is approximately 1.010. The CUDA allocator rounds allocations to internal block boundaries. At seq=4096 the ratio rounds to 1.000. The GQA paper reports memory savings using the analytical formula only. At production sequence lengths the formula is exact; at short contexts it overstates savings by 1-3%.

Two Years Later

Where it runs: LLaMA 3 (8B, 70B, 405B), Mistral 7B and its derivatives, Gemma 2 (9B, 27B), Qwen 2.5, Phi-3 Medium. Every modeling_llama.py in HuggingFace implements all three mechanisms. vLLM, TGI, and TensorRT-LLM build their attention kernels around the GQA layout (from Ainslie et al. 2023) and RoPE rotations (from Su et al. 2021).

What's changed:

Context length is the dimension that has moved furthest. LLaMA 1 targeted 2K, LLaMA 2 doubled to 4K. LLaMA 3.1 extended to 128K via additional long-context training with one specific lever: rope_theta. LLaMA 2 uses rope_theta = 10,000, the default from the original RoPE paper (Su et al. 2021). LLaMA 3 bumped it to 500,000. Raising rope_theta shifts every dimension pair's rotation speed slower, so the slow-rotating pairs retain meaningful discrimination capacity at much greater token distances. The 50x increase in rope_theta roughly tracks the 50x increase in effective context. YaRN (2023) and the LLaMA 3 technical report both identify rope_theta scaling as the primary lever for context extension without retraining from scratch.

GQA is effectively the default for open-weight decoder models released since 2023: Mistral 7B, LLaMA 3 (all sizes), Qwen 2.5, and Phi-3 all ship it. The earlier debate between GQA and MQA (single shared KV head) settled in favor of GQA: MQA is measurably worse above 34B parameters, where the single key-value representation bottlenecks model capacity.

In decoder-only models at 7B+ scale, RMSNorm has effectively replaced LayerNorm across every major open-weight release since 2023.

The production gotcha: Model weights are a fixed one-time memory cost. The KV cache is not. It grows with every token generated, every concurrent user, every active session. At 100 concurrent users each holding a conversation at 8K context, a 70B GQA model needs:

2 × 100 × 80 × 8192 × 8 × 128 × 2  ≈  268 GB of KV cache

That is roughly 1.9x the model weights. The model fits on two A100s. The conversations do not. This is exactly why vLLM implements PagedAttention and why KV cache eviction exists in every production serving stack. GQA makes the per-token cost 8x smaller. At high concurrency and long context, the cache still dominates. LLaMA 2 identifies the mechanism; the serving systems covered later in this series are built around managing it.

The Code

Both artifacts are in llama-2/.

Artifact 1 (rope_from_scratch.py): RoPE in pure NumPy, no PyTorch. Proves the relative-distance property over 20 random position pairs, verifies against a complex-number reference (max error 0.00e+00), and generates the decay curve comparison against sinusoidal PE and the rotation angle heatmap. Hardware: CPU-only. Runtime: ~45 seconds.

Artifact 2 (gqa_kvcache_benchmark.py): Allocates real KV cache tensors on GPU for four architecture configurations, measures memory with torch.cuda.memory_allocated(), sweeps seq 128-4096, computes max batch under a 20GB budget, and generates the memory and batch capacity charts. Hardware: RTX 3090 (24GB VRAM), CUDA required. Runtime: ~25 seconds.

Dependencies: PyTorch 2.1.0, NumPy 1.24.3, Matplotlib 3.7.1.

Attention Is All You Need: What the Paper's Heads Are Actually Doing at Each Layer

Yash Patel — Sun, 08 Mar 2026 09:00:00 GMT

Every production LLM you interact with today, LLaMA 3, Mistral, Gemma, Claude, runs on multi-head attention as its core computation. The paper that introduced it, "Attention Is All You Need" (Vaswani et al., 2017), proved the mechanism works and observed in passing that different heads appear to learn different things. That observation is in the paper. The measurement is not.

The paper never quantifies what happens to head specificity as you go deeper into the network. No entropy measurement, no gradient from early to late layers, no empirical signal about how the network organises itself internally across depth. It shows a few hand-selected attention visualizations and moves on. Nine years later, every fine-tuning guide tells you to "freeze the early layers" without explaining what those layers are actually doing differently from the late ones.

That gap is what this post fills. I measured Shannon entropy per head across all 12 layers and 12 heads of GPT-2 small on 100 varied English sentences. The result is a clean monotonic gradient: early-layer heads attend broadly (mean entropy 1.42 nats), late-layer heads lock onto specific tokens (mean entropy 0.50 nats). The two populations are nearly 3x apart. This is not a subtle effect.

If you are deciding how many layers to freeze during LoRA fine-tuning, or debugging why a model attends to the wrong tokens at inference, understanding this gradient is the starting point. The paper gives you the mechanism. This post gives you the empirical structure inside it.

How Attention Actually Works

Scaled dot-product attention replaces the sequential state update of an RNN with a direct query over every token in parallel. The core idea in two sentences: for each token, compute a compatibility score against every other token via a dot product, normalize those scores with softmax, and blend the corresponding value vectors by those weights. This produces a context-aware representation of every token in a single matrix multiply: no sequential dependency, no stored hidden state.

The paper formalizes this as:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Q, K, V are not the same matrix. Every token is projected three times with separate learned weights. W_Q produces the vector that gets dot-producted against other tokens' keys to produce compatibility scores. W_K produces the vector that gets matched against other tokens' queries. W_V produces the vector whose weighted blend becomes the output. The same token "bank" produces three different vectors: a Q that scores high against financial-context keys, a K that scores high against tokens querying for nouns, and a V carrying its semantic content. Separating these three roles is what lets different heads specialise: one head's Q-K scoring geometry can learn syntactic adjacency while another's learns semantic relatedness.

The √d_k scaling is not cosmetic. With d_k=64, a dot product between two random 64-dimensional vectors has expected magnitude ~8, because variance grows linearly with dimension. Without scaling, large values push softmax into its saturated regime: all probability mass collapses onto one token and gradients vanish. Dividing by √64=8 keeps input variance at ~1.

Multi-head attention runs 8 independent attention passes over d_k=64-dimensional subspaces, then concatenates and projects the results. The motivation: a single softmax over 512 dimensions tends to collapse into near-one-hot distributions, losing the ability to track multiple relationships simultaneously. Eight heads over 64 dimensions each costs roughly the same as one head over 512 dimensions. The key insight is that you split before computing attention, not after.

The implementation detail that trips people: the heads are not run in a loop. The input starts as (batch, seq, d_model). A view and transpose converts it to (batch, heads, seq, d_k), placing the heads dimension where PyTorch's matmul can treat them as independent batch dimensions. With shape (1, 8, 10, 64), a single torch.matmul(Q, K.transpose(-2, -1)) produces all 8 score matrices simultaneously. The compute graph is identical to 8 separate matrix multiplies, but the batched form maps to a single CUBLAS kernel call.

The causal mask for decoder self-attention enforces that token i cannot attend to position j > i. The upper triangle of the score matrix is set to negative infinity before softmax fires. Since exp(-inf) = 0 exactly, future tokens contribute zero weight to the output, and the row sums remain exactly 1 without any additional normalisation step:

causal mask, 5 tokens:
token 0: [  s00, -inf, -inf, -inf, -inf ]  ← only sees itself
token 1: [  s10,  s11, -inf, -inf, -inf ]
token 2: [  s20,  s21,  s22, -inf, -inf ]

Positional encoding patches the mechanism's intrinsic blindness to order. Attention is a set operation: permuting the input tokens produces the same weighted sums, just permuted. Word order is invisible without an explicit signal. The paper injects position by adding sinusoidal vectors to the input embeddings before the first layer. Each of the 512 dimensions oscillates at a different frequency: the first dimension cycles every two positions; the last completes one full cycle across 10,000 positions. The model can infer absolute position from the joint oscillation pattern across all 512 dimensions. Practically, this was superseded by RoPE in modern LLMs, but the requirement remains: position information must be injected explicitly.

What I Found Running It

I implemented scaled dot-product attention and multi-head attention in 200 lines of pure PyTorch, without using torch.nn.MultiheadAttention. Every intermediate tensor shape is annotated inline. Hardware: RTX 3090 (24GB VRAM). Library versions: torch==2.1.0, transformers==4.38.2.

What matched: Running my implementation against F.scaled_dot_product_attention on identical inputs with a causal mask gives a max absolute difference of 1.19e-07. The implementations agree to floating-point precision.

Max absolute difference vs PyTorch reference : 1.19e-07
PASS  implementation matches reference

I found that: the VRAM cost of the score matrix hits numbers that make batching impossible far earlier than intuition suggests. The score matrix Q·K^T has shape (batch × heads × n × n) in float32, meaning 4 bytes × n² entries per layer. I ran forward passes at five sequence lengths on a single-layer toy model on the RTX 3090:

 Seq length    Peak VRAM      Theoretical score matrix
──────────────────────────────────────────────────────
          64      22 MB            2.6 MB
         128      24 MB           10.5 MB
         256      33 MB           41.9 MB
         512      68 MB          167.8 MB
        1024     241 MB          671.1 MB

Peak VRAM measured on RTX 3090 across five sequence lengths. The quadratic fit confirms O(n^2) growth: at n=1024, the score matrix alone reaches 671MB for a single-layer model.

At n=1024, the score matrix for a single-layer toy model reaches 671 MB. Scale that to GPT-3's 96 layers and you get the number that made Flash Attention (2022) a necessity, not an optimisation.

What the paper doesn't measure: Shannon entropy falls monotonically with layer depth. I ran GPT-2 small (117M parameters, 12 layers × 12 heads) over 100 English sentences covering SVO, relative clauses, passive constructions, and coreference. Per head I measured Shannon entropy (how diffuse the attention distribution is) and diagonal score (fraction of attention within ±2 positions of the diagonal, as a proxy for purely positional heads). I classified all 144 heads into four empirical types: local (positional), copy (locks onto one or two tokens), broad (attends widely), and mixed.

KEY FINDING: layer-depth entropy gradient
  Early layers (0-3)  mean entropy : 1.421 nats
  Late  layers (8-11) mean entropy : 0.497 nats
  Gradient            (late-early) : -0.924 nats

Left: mean attention entropy per layer in GPT-2 small across 100 sentences. Right: head type distribution per layer. Early layers are dominated by local and broad heads. Late layers converge on copy and mixed heads.

Mean Shannon entropy per head across 12 layers x 12 heads. Light cells are sharp and focused (low entropy). Dark cells are diffuse (high entropy).

Early layers are nearly 3x more diffuse than late layers. The paper shows hand-selected attention visualizations and notes that "different heads learn to perform different tasks." It never quantifies what happens to head specificity as depth increases. Layer 0 looks almost uniform; Layer 11 looks like a spiked distribution.

I found that the head classification requires checking diagonal score before entropy. A local head, one that only attends to the token immediately to its left, can have very low entropy because its distribution is also sharply peaked. Checking entropy first mislabels it as a copy head. The diagonal check catches it correctly as a positional head. This ordering matters if you are building any automated attention analysis tooling.

Where It Runs in 2026

Where it runs: Every transformer-family model in production. LLaMA 2/3, Mistral, Gemma, Falcon, GPT-4, Claude: all implement direct descendants of the mechanism in this paper. PyTorch's nn.MultiheadAttention, every Hugging Face model's attention module, and every CUDA kernel in vLLM, TGI, and TensorRT-LLM trace back to Section 3.2.

What's changed since the paper:

The Table 3 configuration (d_model=512, 8 heads, 6 layers, sinusoidal positional encoding) is a toy by current standards. The biggest change is positional encoding. Sinusoidal fixed encodings, as used in the original paper, were replaced by Rotary Position Embeddings (RoPE) in LLaMA and Mistral and by ALiBi in MPT. RoPE applies a rotation matrix to Q and K inside the attention operation rather than adding position to the input embedding. This gives better length generalisation and sharper relative distance signal, which is why it is the default choice for every model targeting 128K+ context windows.

The other structural change with comparable impact is the shift from encoder-decoder to decoder-only. GPT, LLaMA, and Mistral drop the encoder entirely. Cross-attention between encoder and decoder is replaced by in-context conditioning through the causal attention mask. The masked decoder self-attention from Section 3.2.3 is the only attention mechanism in every autoregressive LLM today. The encoder half of the original paper is now primarily relevant for embedding models like BERT and its descendants.

Pre-norm (LayerNorm before each sub-layer) replaced post-norm because post-norm causes gradient explosions at 70B+ scale. Grouped Query Attention (GQA), used in LLaMA 2/3 70B, shares 8 K/V heads across 32 query heads, cutting KV cache by 4x with under 1% accuracy loss.

The production gotcha: KV cache memory at long contexts exceeds model weights. LLaMA-3-70B has approximately 70GB of weights in float16. With GQA, each token in the KV cache costs: 80 layers × 8 KV heads × 128 d_k × 2 (K and V) × 2 bytes = ~327KB per token. At a 128K context, that is ~42GB of KV cache. Without GQA (full multi-head), it would be ~168GB, more than double the model weights. Batching 10 concurrent users at 128K context without GQA would require 1.7TB of KV cache. This is why GQA, Multi-Query Attention, and KV cache quantization exist in every production serving stack: the scaling law for the attention mechanism bites in production before it bites in benchmarks.

When to Use It, When Not To

USE WHEN:

Input length is < 8K tokens and you need all-to-all relationships: standard fine-tuning on classification, summarisation, or translation tasks
Training on a multi-GPU cluster where the parallelization benefit of attention over RNNs is the primary constraint
The task requires resolving long-range dependencies (coreference, discourse coherence) that CNNs cannot reach in a single pass
You want interpretable attention patterns for debugging or analysis

DON'T USE WHEN:

Sequence length regularly exceeds 8K tokens in production without Flash Attention: the O(n²) score matrix makes large batches impossible at naive float32
You're targeting sub-100MB VRAM inference: at n=1024 the score matrix alone is 671MB per layer
You need O(1) memory per token for streaming inference: attention requires the full KV context (KV cache grows linearly with sequence length)

USE ALTERNATIVE INSTEAD:

Scenario	Alternative	Why
Context > 32K tokens	Flash Attention 2/3	Rewrites attention kernel to avoid materializing the O(n²) score matrix; same mathematical output, O(n) peak VRAM
Many concurrent users at long context	Grouped Query Attention (GQA)	Shares K/V heads across query heads; reduces KV cache 4–8× with <1% accuracy loss
Inference VRAM < 4GB	State Space Models (Mamba)	O(n) memory via selective recurrence; no KV cache at all; competitive on many tasks
Document retrieval, not generation	Late interaction (ColBERT)	Per-token MaxSim scoring instead of full attention; retrieves better at lower compute

The core trade-off is exact: attention provides O(1)-hop connection between any two tokens with O(n²) space. Every production modification either approximates that connection (GQA, local attention windows) or rewrites the computation graph to avoid materializing it (Flash Attention). The paper introduced the mechanism. Six years of engineering work has been spent making it practical at scale.

The Code

Both artifacts are in attention-is-all-you-need/.

Artifact 1 (attention_from_scratch.py): Implements scaled dot-product attention and multi-head attention in 200 lines of pure PyTorch with explicit QKV projections, verified against F.scaled_dot_product_attention (max diff 1.19e-07). Includes a VRAM scaling experiment across five sequence lengths demonstrating quadratic growth. Runs in ~2 minutes on an RTX 3090 or CPU.

Artifact 2 (attention_head_analysis.py): Loads GPT-2 small (117M parameters) and runs 100 varied English sentences through all 12 layers × 12 heads. Measures per-head Shannon entropy and diagonal locality score, classifies all 144 heads into four empirical types, and produces three charts: the 12×12 entropy heatmap, the layer-depth gradient showing specialisation increasing with depth, and one representative attention matrix per head type.

Hardware: RTX 3090 (24GB VRAM). Dependencies: PyTorch 2.1.0, Transformers 4.38.2, Matplotlib 3.7.1.

Beyond try/except: Architecting Robust Error Handling in Python Applications

Yash Patel — Wed, 23 Apr 2025 06:56:11 GMT

As Python developers gain experience, the simple try...except block, while essential, often reveals its limitations in larger, more complex applications. We move from merely catching errors to needing a coherent strategy for managing them – one that enhances readability, simplifies debugging, and improves overall system resilience. Basic try/except blocks can quickly lead to tangled logic, obscured error origins, and difficulty in distinguishing between expected hiccups and genuine catastrophes.

This post delves into architectural patterns and advanced techniques for error handling in Python, targeting engineers looking to build more maintainable and resilient systems. We'll explore how thoughtful design, custom exceptions, alternative signalling patterns, strategic logging, and dedicated testing can transform error handling from a reactive chore into a proactive element of robust application architecture. Let's move beyond basic error catching and architect applications that don't just crash gracefully but handle failures intelligently.

Introduction: The Limits of Basic Exception Handling

The standard try...except mechanism is Python's cornerstone for handling runtime errors. It allows us to gracefully recover from unexpected situations. However, relying solely on generic except Exception: or scattering try...except blocks liberally throughout a codebase often leads to problems:

Loss of Specificity: Catching broad exceptions makes it hard to know what actually went wrong and react appropriately.
Obscured Control Flow: Deeply nested try...except blocks can make code difficult to follow and reason about.
Mixing Error Logic with Business Logic: Interspersing error handling frequently complicates the core functions or methods.
Inconsistent Handling: Different parts of the application might handle similar errors in wildly different ways.

To build scalable and maintainable applications, we need to elevate our error handling strategy.

Designing Custom Exception Hierarchies for Clarity

Python's built-in exceptions (ValueError, TypeError, FileNotFoundError, etc.) are great, but they often lack application-specific context. Defining your own exception hierarchy provides semantic meaning and allows for more granular error handling.

Why Create Custom Exceptions?

Clarity: UserServiceError tells you much more than a generic ValueError.
Targeted Handling: You can catch specific application-level errors (except UserNotFoundError:) separate from lower-level issues (except DatabaseConnectionError:).
Encapsulation: Custom exceptions can carry additional context about the error (e.g., relevant IDs, failed parameters).

Designing the Hierarchy:

Start with a base application exception and derive specific errors from it.

Python

# --- exceptions.py ---
import logging

logger = logging.getLogger(__name__)

class ApplicationError(Exception):
    """Base class for application-specific errors."""
    def __init__(self, message="An application error occurred.", original_exception=None, context=None):
        super().__init__(message)
        self.original_exception = original_exception
        self.context = context or {}
        # Log the error creation centrally if desired (can be noisy)
        # logger.error(f"{self.__class__.__name__}: {message}", exc_info=original_exception)

class DatabaseError(ApplicationError):
    """Errors related to database operations."""
    def __init__(self, message="Database operation failed.", original_exception=None, context=None):
        super().__init__(message, original_exception, context)

class ValidationError(ApplicationError):
    """Errors related to data validation."""
    def __init__(self, message="Validation failed.", field=None, value=None, context=None):
        super().__init__(message, context=context)
        self.field = field
        self.value = value
        if field:
            self.context['field'] = field
        if value is not None:
            self.context['invalid_value'] = value

class AuthenticationError(ApplicationError):
    """Errors related to user authentication or authorization."""
    pass

class ExternalServiceError(ApplicationError):
    """Errors when communicating with external services."""
    def __init__(self, message="External service communication failed.", service_name=None, original_exception=None, context=None):
        super().__init__(message, original_exception, context)
        self.service_name = service_name
        if service_name:
            self.context['service_name'] = service_name

# --- Example Usage ---
# In your data access layer:
# try:
#     # db_operation(...)
# except SomeDbLibraryError as e:
#     raise DatabaseError(original_exception=e, context={'query': 'SELECT...'})

# In your input validation logic:
# if not is_valid_email(email):
#     raise ValidationError(field='email', value=email)

# In higher-level code:
# try:
#     process_user_request(data)
# except ValidationError as ve:
#     return api_error_response(f"Invalid input for {ve.field}", status_code=400)
# except DatabaseError as de:
#     logger.exception("Critical database error during user request.") # Log full trace
#     return api_error_response("Internal server error", status_code=500)
# except ApplicationError as ae:
#     logger.warning(f"Application error: {ae}", exc_info=ae.original_exception)
#     return api_error_response("An unexpected error occurred.", status_code=500)

By catching ApplicationError, you can handle all your custom errors, while still allowing specific handling for ValidationError or DatabaseError where needed.

Pattern: The Result Object for Explicit Error Signalling

Not all "failures" are exceptional. Sometimes, an operation is expected to fail under certain conditions (e.g., user not found, validation rule violation). Raising exceptions for these expected outcomes can be cumbersome and performance-intensive if frequent. The Result pattern (inspired by functional programming concepts like Monads, particularly Either or Result) offers an alternative.

Instead of raising an exception, a function returns an object that explicitly represents either success (containing the value) or failure (containing error details).

Benefits:

Explicitness: The function signature makes it clear that it can fail in a controlled way.
Clean Control Flow: Reduces the need for try/except blocks for expected failures.
Type Safety (with type hints): Helps ensure callers handle both success and failure cases.

Simple Implementation:

Python

# --- result.py ---
from typing import TypeVar, Generic, Union, Optional, Any

T = TypeVar('T') # Success type
E = TypeVar('E') # Error type

class Success(Generic[T]):
    def __init__(self, value: T):
        self._value = value

    def is_success(self) -> bool:
        return True

    def is_failure(self) -> bool:
        return False

    def get_value(self) -> T:
        return self._value

    def get_error(self) -> None:
        raise ValueError("Cannot get error from Success")

class Failure(Generic[E]):
    def __init__(self, error: E):
        self._error = error

    def is_success(self) -> bool:
        return False

    def is_failure(self) -> bool:
        return True

    def get_value(self) -> None:
        raise ValueError("Cannot get value from Failure")

    def get_error(self) -> E:
        return self._error

Result = Union[Success[T], Failure[E]]

# --- Example Usage ---
from typing import Dict, Any

# Assume ValidationError is defined as in the previous section
def validate_user_data(data: Dict[str, Any]) -> Result[Dict[str, Any], ValidationError]:
    email = data.get('email')
    if not email or '@' not in email:
        # Return Failure for an *expected* validation issue
        return Failure(ValidationError(field='email', value=email, message="Invalid email format"))

    # ... other validations ...

    # Return Success if valid
    return Success(data) # Or perhaps return a validated User object

def process_registration(data: Dict[str, Any]):
    validation_result = validate_user_data(data)

    if validation_result.is_failure():
        error = validation_result.get_error()
        print(f"Registration failed validation: {error} (Field: {error.field})")
        # Return an appropriate response or re-raise if truly exceptional here
        return {"status": "error", "message": f"Validation failed: {error.message}"}

    # If success, proceed with validated data
    validated_data = validation_result.get_value()
    print("Validation successful, proceeding with registration...")
    # ... create user in database (this might raise a DatabaseError exception) ...
    return {"status": "success", "user_id": 123}

# Calling the function
user_data_invalid = {"name": "Test"}
process_registration(user_data_invalid)

user_data_valid = {"name": "Test", "email": "test@example.com"}
process_registration(user_data_valid)

Libraries like returns or results provide more sophisticated implementations of this pattern. Use Results for predictable failure paths and Exceptions for truly unexpected or system-level errors.

Centralized vs. Localized Error Handling: Making the Right Choice

Where should you handle errors?

Localized Handling: Catch and handle errors immediately where they occur.
- Pros: Simple for straightforward cases, keeps handling logic close to the source.
- Cons: Can lead to repetition, mixes error logic with business logic, difficult to enforce consistency.
- Best for: Recoverable errors where the immediate context has all the information needed to proceed or compensate (e.g., retrying a network call, falling back to a default value).
Centralized Handling: Allow exceptions to propagate up the call stack and handle them at specific boundaries (e.g., API endpoint decorators, middleware, main application loop).
- Pros: Enforces consistency (logging, error responses), separates concerns, simplifies core business logic.
- Cons: Can sometimes lose specific context if not propagated correctly (custom exceptions help here!), might require more setup (e.g., framework middleware).
- Best for: Handling errors that terminate a request/response cycle (web apps), performing consistent logging/alerting, converting exceptions to user-friendly messages or standard error formats (e.g., JSON API error responses).

Example (Web Framework Middleware/Decorator):

Python

# Using Flask as an example (similar concepts apply to Django, FastAPI, etc.)
from flask import Flask, jsonify, request
from exceptions import ApplicationError, ValidationError, DatabaseError # Our custom exceptions
import logging

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

@app.errorhandler(ValidationError)
def handle_validation_error(error: ValidationError):
    logging.warning(f"Validation failed: {error.message} for field '{error.field}'")
    response = {
        "error": "VALIDATION_ERROR",
        "message": error.message,
        "details": error.context
    }
    return jsonify(response), 400 # Bad Request

@app.errorhandler(DatabaseError)
def handle_database_error(error: DatabaseError):
    # Log the full exception trace for internal errors
    logging.exception(f"Database error occurred processing {request.path}")
    response = {
        "error": "INTERNAL_SERVER_ERROR",
        "message": "A database error occurred. Please try again later."
    }
    return jsonify(response), 500 # Internal Server Error

@app.errorhandler(ApplicationError)
def handle_application_error(error: ApplicationError):
    # Catch-all for other specific application errors
    logging.error(f"Application error occurred: {error}", exc_info=error.original_exception)
    response = {
        "error": "APPLICATION_ERROR",
        "message": str(error) or "An unexpected application error occurred."
    }
    return jsonify(response), 500 # Internal Server Error

@app.errorhandler(Exception)
def handle_generic_exception(error: Exception):
    # Catch any unexpected Python errors not caught by specific handlers
    logging.exception(f"Unhandled exception occurred processing {request.path}")
    response = {
        "error": "UNEXPECTED_ERROR",
        "message": "An unexpected server error occurred."
    }
    return jsonify(response), 500

@app.route('/users', methods=['POST'])
def create_user():
    # ... get data from request ...
    data = request.json
    # Assume process_registration raises ValidationError or DatabaseError on failure
    # It doesn't need try/except blocks internally for these specific errors anymore
    # because the error handlers above will catch them.
    # result = process_registration(data) # Function might still raise exceptions
    # Mocking potential errors for demonstration
    if not data.get('email'):
         raise ValidationError(field='email', message='Email is required')
    if data.get('trigger_db_error'):
         raise DatabaseError("Failed to connect to user DB")

    return jsonify({"status": "success", "user_id": 42}), 201

if __name__ == '__main__':
    app.run(debug=True) # Debug mode shows interactive traceback; test handlers with debug=False

Often, a combination is best: handle specific, recoverable errors locally, and let broader or request-terminating errors propagate to a centralized handler.

Effective Logging Strategies Tied to Exceptions

Logging is crucial for understanding errors after they occur. Integrate it tightly with your exception handling strategy.

Log at the Right Level:
- logging.ERROR or logging.CRITICAL: For unhandled exceptions or serious failures caught in centralized handlers. Include stack traces (exc_info=True).
- logging.WARNING: For handled exceptions that indicate a potential problem or an expected but notable failure (e.g., validation errors, external service timeouts if handled gracefully).
- logging.INFO: For significant application lifecycle events, not typically for errors themselves.
- logging.DEBUG: For detailed information useful only during development/debugging.
Include Context: Log relevant information like user IDs, request IDs, input parameters (be careful with sensitive data!), and custom exception attributes. Centralized handlers and custom exception context attributes are great for this.
Use Structured Logging: Log messages in formats like JSON. This makes logs much easier to parse, filter, and analyze with log aggregation tools (e.g., ELK stack, Datadog, Splunk).

Python

import logging
import json

# Configure structured logging (basic example)
class JsonFormatter(logging.Formatter):
    def format(self, record):
        log_record = {
            "timestamp": self.formatTime(record, self.datefmt),
            "level": record.levelname,
            "message": record.getMessage(),
            "logger_name": record.name,
        }
        if record.exc_info:
            # Add traceback if available
            log_record['exception'] = self.formatException(record.exc_info)
        if hasattr(record, 'context'): # Add custom context if provided
             log_record.update(record.context)
        return json.dumps(log_record)

handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())

logging.basicConfig(level=logging.INFO, handlers=[handler])
logger = logging.getLogger(__name__)

# --- Example Usage ---
from exceptions import DatabaseError # Our custom exception

request_id = "req-123abc"
user_id = 42

try:
    # Simulate a database operation failing
    raise ConnectionError("DB connection timeout")
except ConnectionError as e:
    # Wrap the original exception and add context
    db_error = DatabaseError(
        original_exception=e,
        context={'query_details': 'UPDATE users SET ...', 'user_id': user_id, 'request_id': request_id}
    )
    # Log with ERROR level, exc_info, and custom context
    logger.error(
        f"Database operation failed for user {user_id}",
        exc_info=db_error,
        extra={'context': db_error.context} # Pass context to logger
    )
    # Re-raise or handle as appropriate
    # raise db_error

Testing Your Error Paths: Ensuring Resilience

Your error handling code is code too, and it needs testing!

Test Exception Raising: Use pytest.raises to assert that specific functions raise the expected custom exceptions under failure conditions.
Test Exception Handling: In integration or end-to-end tests, simulate failure conditions (e.g., mock a database call to raise an error) and verify that your centralized handlers produce the correct output (e.g., the right HTTP status code and error JSON).
Test Result Objects: If using the Result pattern, write tests that check both the Success and Failure return paths, verifying the contained value or error object.

Python

import pytest
from exceptions import ValidationError, DatabaseError
from result import Success, Failure # Assuming result.py from earlier

# --- Functions to test (simplified examples) ---
def validate_email(email: str | None) -> None:
    if not email or '@' not in email:
        raise ValidationError(field='email', value=email)

def might_fail_db() -> None:
    # Simulate a potential failure
    raise DatabaseError("Connection failed")

def process_data_result(data: dict) -> Result[str, str]:
     if data.get("valid"):
         return Success("Processed successfully")
     else:
         return Failure("Invalid data provided")

# --- Tests using pytest ---
def test_validate_email_raises_validation_error():
    with pytest.raises(ValidationError) as excinfo:
        validate_email("invalid-email")
    assert excinfo.value.field == 'email'
    assert excinfo.value.value == "invalid-email"

    with pytest.raises(ValidationError):
        validate_email(None)

def test_validate_email_success():
    try:
        validate_email("test@example.com")
    except ValidationError:
        pytest.fail("ValidationError raised unexpectedly")

def test_db_function_raises_database_error():
    with pytest.raises(DatabaseError):
        might_fail_db()

def test_process_data_result_success():
    result = process_data_result({"valid": True})
    assert result.is_success()
    assert not result.is_failure()
    assert result.get_value() == "Processed successfully"
    with pytest.raises(ValueError): # Cannot get error from Success
        result.get_error()


def test_process_data_result_failure():
    result = process_data_result({"valid": False})
    assert not result.is_success()
    assert result.is_failure()
    assert result.get_error() == "Invalid data provided"
    with pytest.raises(ValueError): # Cannot get value from Failure
        result.get_value()

# For testing centralized handlers (e.g., Flask app):
# Use the test client provided by the framework
# def test_api_validation_error(client): # client is a pytest fixture for the Flask test client
#     response = client.post('/users', json={'name': 'Test'}) # Missing email
#     assert response.status_code == 400
#     assert response.json['error'] == 'VALIDATION_ERROR'
#     assert 'email' in response.json['details'].get('field', '')

Conclusion: Error Handling as a Core Architectural Concern

Robust error handling is not an afterthought; it's a cornerstone of reliable, maintainable software. By moving beyond basic try/except blocks and adopting more structured approaches, we can significantly improve our Python applications.

Design custom exception hierarchies for semantic clarity and targeted handling.
Consider the Result pattern for explicit signaling of expected, non-exceptional failures.
Strategically choose between localized and centralized handling to balance simplicity and consistency.
Integrate logging deeply with your error handling, providing context and structure.
Rigorously test your error paths to ensure your safety nets actually work.

By incorporating these techniques, you shift from simply reacting to errors to proactively architecting for resilience. This investment pays dividends in easier debugging, more stable applications, and a more pleasant development experience for you and your team.

Demystifying Reinforcement Learning: A Beginner's Guide to the Math

Yash Patel — Tue, 04 Mar 2025 00:00:00 GMT

Introduction

Imagine teaching a computer to play chess from scratch. How would it learn which moves lead to checkmate and which lead to defeat? How would it understand the long-term consequences of capturing a pawn versus protecting its queen? This is the essence of reinforcement learning (RL), where agents learn optimal decision-making through interaction with their environment.

In this comprehensive blog, we'll explore Chapter 3 of Sutton and Barto's "Reinforcement Learning: An Introduction," which lays the mathematical foundation for understanding how RL agents learn. Whether you're a computer science student or an AI enthusiast, this guide will help you grasp these concepts through detailed explanations and our running chess analogy.

1. The Agent-Environment Interface

At the heart of reinforcement learning is the interaction between an agent (our chess-playing AI) and its environment (the chess game). This interaction happens in discrete time steps and follows a specific pattern:

The agent observes the current state (S₁) - in chess, this is the current board position
Based on this state, the agent selects an action (A₁) - a specific chess move
The environment responds with:
- A new state (S₂) - the new board position after both players move
- A reward (R₂) - perhaps +1 for capturing a piece, -1 for losing one, or +100 for checkmate

This cycle continues, creating a sequence: S₁, A₁, R₂, S₂, A₂, R₃, S₃, ... and so on.

The Mathematical Framework: Markov Decision Processes (MDPs)

This interaction is formalized as a Markov Decision Process (MDP), which has five key components:

A set of states (S) - all possible chess board configurations
A set of actions (A) - all legal chess moves
A set of rewards (R) - numerical feedback values
State transition probabilities - how likely each new state is, given the current state and action
Reward probabilities - how likely each reward is, given the state and action

In a finite MDP, these sets contain a finite number of elements, making the problem mathematically tractable.

The Dynamics Function

The environment's behaviour is completely described by the dynamics function:

$$p(s', r | s, a) = \Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}$$

This function gives the probability of transitioning to state s' and receiving reward r, given that we were in state s and took action a.

Let's break this down with our chess example:

s: Current board position with your knight threatened by an opponent's pawn
a: Move your knight to safety
s': New board position after your move and your opponent's response
r: Reward (perhaps +0.5 for saving your piece)

The dynamics function tells us the probability of ending up in state s' with reward r after taking action a in state s.

The Markov Property

A critical aspect of MDPs is the Markov property, which states that the future depends only on the present state and action, not on the history of how we got there. In chess terms, the best move depends only on the current board position, not on the sequence of moves that created it.

This might seem counterintuitive at first - doesn't chess strategy depend on understanding your opponent's past moves? The key insight is that if the "state" is properly defined to include all relevant information (perhaps including a model of your opponent's strategy based on their past moves), then the Markov property holds.

Derived Quantities from the Dynamics Function

From the dynamics function, we can derive other useful quantities:

State-transition probabilities:

$$p(s' | s, a) = \sum_r p(s', r | s, a)$$

Expected rewards for state-action pairs:

$$r(s, a) = \sum_r r \sum_{s'} p(s', r | s, a)$$

Expected rewards for state-action-next-state triples:

$$r(s, a, s') = \sum_r r \left[ \frac{p(s', r | s, a)}{p(s' | s, a)} \right]$$

The Agent-Environment Boundary

An important conceptual point is where to draw the line between agent and environment. In our chess example, the agent is the decision-making algorithm, while the environment includes the chess board, pieces, rules, and opponent.

The general principle is: anything the agent cannot arbitrarily change is part of the environment. The agent's knowledge about the environment is separate from this boundary - an agent might know everything about chess rules but still face a challenging learning problem.

2. Goals and Rewards

In reinforcement learning, the agent's goal is formalized through rewards - numerical values that the agent receives after each action. The fundamental principle is the reward hypothesis:

All goals can be described as the maximization of expected cumulative reward.

Designing Reward Signals

Designing an effective reward signal is crucial but tricky. For our chess AI:

Too sparse: Only +1 for winning, -1 for losing, 0 otherwise. This makes learning slow because feedback is delayed until the end of a long game.
Too frequent/misleading: +1 for each piece captured might encourage the AI to sacrifice important pieces to capture pawns.
Well-designed: Perhaps +0.1 for controlling center squares, +0.5 for checking the opponent, +1 for capturing pieces (weighted by value), and +100 for checkmate.

The key principle is that rewards communicate WHAT to achieve, not HOW to achieve it. If we reward subgoals too strongly, the agent might optimize for those at the expense of the true goal.

For example, if we heavily reward capturing pieces, our chess AI might sacrifice positional advantage just to capture a pawn. Instead, the reward should reflect the true objective - winning the game - with smaller rewards for actions that generally contribute to that goal.

Examples of Reward Design

Chess AI: +100 for checkmate, -100 for being checkmated, +piece_value for captures, -piece_value for losses, small rewards for controlling important squares.
Robot Navigation: -1 for each time step (to encourage efficiency), -10 for collisions, +100 for reaching the goal.
Stock Trading Agent: Reward proportional to portfolio value increase, with perhaps small penalties for excessive trading (to discourage churn).

The art of reward design involves balancing immediate feedback (to guide learning) with alignment to the true objective (to ensure the right behavior is learned).

3. Returns and Episodes

Now that we've defined states, actions, and rewards, we need to formalize the agent's objective: maximizing the expected return. The return is the function of future rewards that the agent aims to maximize.

Episodic Tasks

In episodic tasks, the agent-environment interaction naturally breaks into distinct episodes with clear endpoints. A chess game is a perfect example - each game starts from the initial board position and ends with checkmate, stalemate, or resignation.

For episodic tasks, we define the return as the sum of rewards from the current time step until the end of the episode:

$$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T$$

Where T is the final time step of the episode.

For our chess AI, this would be the sum of all rewards received from the current position until the game ends.

Continuing Tasks

In continuing tasks, the interaction continues without a natural endpoint. Examples include ongoing process control or a robot that operates continuously.

For these tasks, summing rewards could lead to infinite returns, making it impossible to compare policies. To solve this, we introduce discounting:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$

Where γ is the discount rate (0 ≤ γ ≤ 1).

The Discount Rate

The discount rate γ determines how much the agent values future rewards:

γ = 0: "Myopic" agent that only cares about immediate rewards
γ close to 0: Agent values near-term rewards much more than long-term rewards
γ close to 1: Agent values future rewards almost as much as immediate ones

In chess, a low γ might lead to an agent that greedily captures pieces without considering long-term positional disadvantages. A high γ would encourage strategic play, where the agent might sacrifice material for long-term advantage.

Recursive Relationship of Returns

Returns at successive time steps are related recursively:

$$G_t = R_{t+1} + \gamma G_{t+1}$$

This recursive relationship is fundamental to many RL algorithms and helps us understand how value propagates backward from future states to current states.

4. Unified Notation for Episodic and Continuing Tasks

To handle both episodic and continuing tasks with a single notation, we use a clever mathematical trick: we treat episode termination as entering a special absorbing state that transitions only to itself and generates only rewards of zero.

This allows us to use the discounted return formula for both types of tasks:

$$G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$

For episodic tasks, we can either:

Set γ = 1 and rely on the episode's finite length
Use γ < 1 and treat the terminal state as absorbing with zero rewards

This unified notation simplifies our mathematical framework and allows us to develop algorithms that work for both types of tasks.

5. Policies and Value Functions

Policies

A policy (π) defines the agent's behavior - it's the strategy that maps states to actions. In reinforcement learning, we typically use stochastic policies, where π(a|s) gives the probability of taking action a in state s.

For our chess AI, the policy would specify the probability of making each possible move in any given board position. A deterministic policy would always choose the same move in the same position, while a stochastic policy might sometimes explore different options.

Value Functions

Value functions estimate how good it is for the agent to be in a given state (or to take a given action in a given state). They're defined in terms of expected future returns.

State-Value Function

The state-value function for policy π is defined as:

$$v_\pi(s) = \mathbb{E}\pi[G_t | S_t = s] = \mathbb{E}\pi\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s\right]$$

This gives the expected return when starting in state s and following policy π thereafter.

In chess terms, vπ(s) would tell us the expected outcome (in terms of our reward metric) when starting from board position s and playing according to strategy π.

Action-Value Function

The action-value function for policy π is defined as:

$$q_\pi(s, a) = \mathbb{E}\pi[G_t | S_t = s, A_t = a] = \mathbb{E}\pi\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s, A_t = a\right]$$

This gives the expected return when starting in state s, taking action a, and thereafter following policy π.

In chess, qπ(s, a) would tell us the expected outcome when making a specific move a from position s, and then continuing with strategy π.

The Bellman Equation

A fundamental property of value functions is that they satisfy recursive relationships known as Bellman equations. For the state-value function:

$$v_\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r|s, a)[r + \gamma v_\pi(s')]$$

This equation expresses a relationship between the value of a state and the values of its successor states. It's saying that the value of the current state equals the expected immediate reward plus the discounted value of the next state.

In chess terms, the value of a position equals the immediate benefit of your next move (capturing a piece, controlling the center, etc.) plus the discounted value of the resulting position.

Backup Diagrams

Backup diagrams visually represent these recursive relationships. They show how value information "backs up" from future states to the current state.

For the state-value function, the backup diagram shows:

The current state at the top
Actions available from that state
Possible next states resulting from each action
The value backing up from those next states to the current state

These diagrams help visualise the flow of value information in reinforcement learning algorithms.

6. Optimal Policies and Optimal Value Functions

The goal in reinforcement learning is to find an optimal policy - one that achieves the maximum expected return from all states. This leads us to define optimal value functions.

Optimal State-Value Function

The optimal state-value function, v*(s), gives the maximum value achievable by any policy for each state s:

$$v_*(s) = \max_\pi v_\pi(s)$$

In chess, v*(s) would tell us the expected outcome when playing optimally from position s.

Optimal Action-Value Function

Similarly, the optimal action-value function, q*(s, a), gives the maximum expected return for each state-action pair:

$$q_*(s, a) = \max_\pi q_\pi(s, a)$$

In chess, q*(s, a) would tell us the expected outcome when making move a from position s and then playing optimally thereafter.

Bellman Optimality Equations

The optimal value functions satisfy special Bellman equations called Bellman optimality equations:

$$v_(s) = \max_a \sum_{s', r} p(s', r|s, a)[r + \gamma v_(s')]$$

$$q_(s, a) = \sum_{s', r} p(s', r|s, a)\left[r + \gamma \max_{a'} q_(s', a')\right]$$

These equations express the principle that the value of a state under an optimal policy equals the expected return for the best action from that state.

Finding Optimal Policies

Once we have the optimal value functions, finding an optimal policy is straightforward:

With v*: Choose actions that maximize the expected value of the next state
With q*: Simply choose the action with the highest q*(s, a) in each state

This is why q* is particularly useful - it directly tells us which actions are best in each state without requiring additional computation.

Greedy Policies

A policy that always selects the action with the highest estimated value is called a greedy policy. When we have the true optimal value function v*, a greedy policy with respect to v* is guaranteed to be optimal.

In chess, this would mean always making the move that leads to the position with the highest v* value.

7. Optimality and Approximation

While we've defined optimal policies and value functions mathematically, finding them exactly is often computationally infeasible for real-world problems.

Computational Challenges

For many interesting problems, the state space is enormous:

Chess has approximately 10^43 legal positions
Go has approximately 10^170
Real-world robotics problems have continuous state spaces with infinite states

Even with today's computing power, we cannot solve the Bellman optimality equations exactly for such large problems.

The Role of Approximation

In practice, reinforcement learning relies on approximation methods:

Function approximation: Using parameterized functions (like neural networks) to represent value functions
Sample-based learning: Learning from experience rather than complete knowledge of the environment
Temporal difference learning: Bootstrapping estimates from other estimates

Focusing on Important States

A key advantage of reinforcement learning is that it can focus computational resources on states that the agent actually encounters, rather than trying to learn optimal behavior for all possible states.

In chess, professional players don't memorize optimal moves for all 10^43 positions - they focus on positions that commonly arise in games. Similarly, reinforcement learning algorithms can prioritize learning about frequently encountered states.

Balancing Exploration and Exploitation

A fundamental challenge in reinforcement learning is the exploration-exploitation dilemma:

Exploitation: Using current knowledge to maximize rewards
Exploration: Trying new actions to discover potentially better strategies

In chess, this would be like deciding whether to play a familiar opening (exploitation) or try a new one (exploration).

Exploration vs. Exploitation: A Deep Dive into Multi-armed Bandits

Yash Patel — Wed, 19 Feb 2025 00:00:00 GMT

A k-armed Bandit Problem

Imagine you're at a casino, faced with a row of slot machines (one-armed bandits), each with its own hidden probability of paying out. Your goal is to maximize your winnings over the night, but you don't know which machines have the best odds. This is essentially the k-armed bandit problem - a fundamental challenge in reinforcement learning that elegantly captures the exploration-exploitation dilemma.

In mathematical terms, we have k different actions (pulling different slot machine arms), and each action a has an expected reward, which we call its value q*(a). When we select an action At at time step t, we receive a reward Rt drawn from a probability distribution dependent on the selected action:

$$q^*(a) \doteq \mathbb{E}[R_t | A_t = a]$$

The catch? We don't know these values in advance. We can only estimate them based on our experience, and these estimates are denoted as Qt(a).

This creates our central dilemma: should we exploit our current knowledge by selecting what appears to be the best arm based on our limited experience (the "greedy" action), or should we explore other arms to potentially discover better options? If we always exploit, we might get stuck with a suboptimal arm. If we always explore, we'll learn a lot but might not maximize our rewards.

This isn't just a theoretical problem. A doctor choosing treatments for patients faces this exact dilemma - should they stick with the treatment that has worked best so far (exploit) or try a promising new alternative (explore)? The stakes in such real-world scenarios are often much higher than casino winnings.

Action-value Methods

To tackle the k-armed bandit problem, we need ways to estimate the value of each action and strategies to select actions based on these estimates.

The most natural approach to estimating action values is to average the rewards we've received when taking that action:

$$Q_t(a) \doteq \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}{A_i=a}}{\sum{i=1}^{t-1} \mathbb{1}_{A_i=a}}$$

Where 1predicate equals 1 when the predicate is true and 0 otherwise. This is called the sample-average method, and by the law of large numbers, as we take action a more and more times, Qt(a) will converge to the true value q*(a).

Once we have these estimates, how do we select actions? The simplest approach is the greedy method:

$$A_t \doteq \underset{a}{\operatorname{argmax}} Q_t(a)$$

This always selects the action with the highest estimated value. But as we've discussed, pure exploitation can lead to suboptimal results.

A simple but effective alternative is the ε-greedy method: with probability ε, we select a random action (exploration), and with probability 1-ε, we select the greedy action (exploitation). This ensures that all actions will eventually be tried enough times for their estimates to converge to their true values.

For example, if we have two actions and set ε=0.5, there's a 50% chance we explore randomly (25% chance for each action) and a 50% chance we exploit. So the greedy action has a 50% + 25% = 75% chance of being selected, while the non-greedy action has only a 25% chance.

The 10-armed Testbed

To evaluate these methods empirically, Sutton and Barto created a testbed with 2,000 randomly generated k-armed bandit problems where k=10. For each problem, the true action values q*(a) were selected from a normal distribution with mean 0 and variance 1. When an action was selected, the actual reward was drawn from a normal distribution with mean q*(a) and variance 1.

Let me walk you through what this means in practice. Imagine one specific bandit problem in this testbed. The true values of the 10 arms might be something like:
[-0.7, 0.5, 1.2, -0.3, 0.8, -1.5, 0.2, 1.5, -0.9, 0.1]

The best arm here is arm 8 with a value of 1.5, but we don't know this in advance. When we pull this arm, we don't get exactly 1.5 as a reward - we get a random value from a normal distribution centered at 1.5. Sometimes we might get 2.3, other times 0.7, occasionally even negative values, though they're less likely.

The results from running different methods on this testbed were revealing:

The greedy method (ε=0) improved quickly at first but often got stuck on suboptimal actions, achieving an average reward of only about 1 out of a possible 1.55.
The ε-greedy methods performed better in the long run because they continued to explore. With ε=0.1, the method explored more and usually found the optimal action earlier but never selected it more than 91% of the time. With ε=0.01, it improved more slowly but eventually performed better.

These results highlight a crucial insight: the value of exploration depends on the specific problem. If rewards were more variable (higher variance), exploration would be even more beneficial. If rewards were deterministic, a greedy approach might work better. And in nonstationary environments, where the best action changes over time, ongoing exploration becomes essential.

Incremental Implementation

When implementing these methods, we need to efficiently compute the action-value estimates. The naive approach would store all rewards and recompute the average each time, but this would require increasing memory and computation as we gather more experience.

Fortunately, we can use an incremental update formula. If Qn is our estimate after n-1 rewards, and we receive a new reward Rn, we can update our estimate as:

$$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$

This elegant formula requires storing only two values per action (the current estimate Qn and the count n) and performs a constant amount of computation per step.

This formula follows a general pattern that appears throughout reinforcement learning:

$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$

The term [Target - OldEstimate] represents an error in our estimate, and we move our estimate toward the target by a fraction determined by the step size.

Here's a complete algorithm for the ε-greedy bandit method with incremental updates:

textInitialize, for a = 1 to k:
    Q(a) ← 0
    N(a) ← 0

Loop forever:
    A ← {
        argmax_a Q(a)  with probability 1-ε
        a random action  with probability ε
    }
    R ← bandit(A)
    N(A) ← N(A) + 1
    Q(A) ← Q(A) + 1/N(A) * [R - Q(A)]

This algorithm efficiently implements the ε-greedy approach with sample-average estimation, requiring minimal memory and computation.

Tracking a Non-stationary Problem

In many real-world scenarios, the true values of actions change over time - what was the best action yesterday might not be the best today. Think of our casino example: what if the slot machines were being adjusted throughout the night, changing their payout probabilities?

The sample-average method we've discussed gives equal weight to all rewards, which isn't ideal for tracking changing values. Instead, we can use a constant step-size parameter α:

$$Q_{n+1} \doteq Q_n + \alpha [R_n - Q_n]$$

This results in Qt being a weighted average of past rewards, with more recent rewards given more weight:

$$Q_{n+1} = (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha (1-\alpha)^{n-i} R_i$$

The weight given to reward Ri is α(1-α)^(n-i), which decreases exponentially as the reward gets older. This is called an exponential recency-weighted average.

For example, with α=0.1, the weight of the most recent reward is 0.1, the previous reward gets 0.09, the one before that 0.081, and so on. This allows our estimates to adapt to changing values.

While constant step sizes are great for nonstationary problems, they don't guarantee convergence to the true action values in stationary problems. For convergence, step-size parameters αn(a) must satisfy:

$$\sum_{n=1}^{\infty} \alpha_n(a) = \infty$$

$$\sum_{n=1}^{\infty} \alpha_n^2(a) < \infty$$

The sample-average method with αn(a)=1/n satisfies these conditions, but constant step sizes don't. However, in practice, the ability to track nonstationary problems often outweighs the theoretical guarantee of convergence in stationary ones.

Optimistic Initial Values

Another approach to encouraging exploration is through optimistic initial values. Instead of initializing our action-value estimates to zero or some neutral value, we set them to be optimistically high - higher than we expect any action's true value to be.

When we select actions based on these optimistic estimates, we'll inevitably be "disappointed" by the actual rewards, which will be lower than our initial estimates. This disappointment drives exploration: as we update our estimates downward for the actions we've tried, untried actions (which still have their optimistic initial values) become relatively more attractive.

For example, in the 10-armed testbed where true action values are normally distributed with mean 0, setting initial estimates to +5 is wildly optimistic. This causes the algorithm to try all actions several times before settling into more exploitative behavior.

The beauty of this approach is that it drives exploration without requiring random action selection. A purely greedy algorithm with optimistic initialization will explore extensively in its early stages.

However, optimistic initialization has a significant limitation: its exploration is inherently temporary. Once all actions have been tried enough to bring their estimates down from the initial optimistic values, the method becomes purely exploitative. If the environment changes later, creating a renewed need for exploration, optimistic initialization can't help.

Upper-Confidence-Bound Action Selection

The Upper-Confidence-Bound (UCB) algorithm takes a more sophisticated approach to the exploration-exploitation trade-off. It selects actions according to:

$$A_t \doteq \underset{a}{\operatorname{argmax}} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$$

Where Nt(a) is the number of times action a has been selected prior to time t, and c > 0 controls the degree of exploration.

This formula balances two factors:

The estimated value Qt(a) (exploitation)
A measure of uncertainty √(ln t / Nt(a)) (exploration)

The uncertainty term increases when an action hasn't been selected for a while (as t increases but Nt(a) doesn't) and decreases when an action is selected (as Nt(a) increases). This naturally drives exploration toward actions with promising values or high uncertainty.

In our casino analogy, UCB is like a strategic gambler who not only considers which machines have paid well so far but also which ones haven't been tried enough to be confident about their payouts.

UCB often performs well in practice, outperforming ε-greedy methods on many problems. However, it's more complex to extend beyond simple bandit problems to the full reinforcement learning setting, particularly for nonstationary problems or large state spaces.

Gradient Bandit Algorithms

Gradient bandit algorithms take a different approach. Instead of estimating action values directly, they learn a preference Ht(a) for each action. Higher preferences make an action more likely to be selected, but preferences don't have any interpretation in terms of reward.

Actions are selected according to a soft-max (Boltzmann) distribution:

$$\Pr\{A_t = a\} \doteq \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}} \doteq \pi_t(a)$$

Initially, all preferences are equal (e.g., Ht(a)=0 for all a), so all actions have an equal probability of being selected.

After selecting action At and receiving reward Rt, the preferences are updated as:

$$H_{t+1}(A_t) \doteq H_t(A_t) + \alpha (R_t - \bar{R}_t) (1 - \pi_t(A_t))$$

$$H_{t+1}(a) \doteq H_t(a) - \alpha (R_t - \bar{R}_t) \pi_t(a), \quad \text{for all } a \neq A_t$$

Where α > 0 is a step-size parameter, and RˉtRˉt is the average of all rewards up to time t, serving as a baseline.

This update rule increases the preference for the selected action if the reward was higher than the baseline and decreases it if the reward was lower. Non-selected actions move in the opposite direction.

What's fascinating about this algorithm is that it can be derived as a stochastic gradient ascent method, maximizing the expected reward. The derivation involves calculus but shows that the algorithm is moving the action preferences in the direction that increases expected reward.

In our casino example, the gradient bandit algorithm would be like a gambler who develops intuitive preferences for different machines rather than trying to estimate their exact payout rates. After each play, they adjust their preferences based on whether the reward was better or worse than what they've been getting on average.

Associative Search (Contextual Bandits)

So far, we've considered nonassociative tasks where there's a single best action (or a best action that changes over time). But in many real-world problems, the best action depends on the situation or context.

Imagine if our casino had multiple slot machines, but their payout probabilities changed based on a visible signal - perhaps a color displayed on the machine. This is an associative search task, also called a contextual bandit problem. We need to learn not just which action is best overall, but which action is best in each context.

For example, if the machine shows red, arm 1 might be best; if it shows green, arm 2 might be best. By learning these associations, we can perform much better than if we treated all situations as the same.

Associative search tasks bridge the gap between the simple k-armed bandit problem and the full reinforcement learning problem. They involve learning a policy (a mapping from situations to actions), but like bandits, each action only affects the immediate reward, not future situations.

Consider a concrete example: suppose you face a 2-armed bandit where the true values randomly switch between two cases:

Case A: action 1 has value 10, action 2 has value 20
Case B: action 1 has value 90, action 2 has value 80

If you can't tell which case you're in, the best you can do is always select action 1, giving an expected reward of (10+90)/2 = 50 per step. But if you're told which case you're facing, you can select action 2 in case A and action 1 in case B, achieving an expected reward of (20+90)/2 = 55 per step.

This example illustrates how contextual information can significantly improve performance, allowing us to adapt our actions to the specific situation we're in.

Conclusion

The multi-armed bandit problem provides a rich framework for understanding the exploration-exploitation dilemma that lies at the heart of reinforcement learning. We've explored several approaches to balancing these competing needs:

ε-greedy methods, which explicitly separate exploration and exploitation
Optimistic initialization, which drives exploration through initially high value estimates
Upper-Confidence-Bound algorithms, which balance exploitation with uncertainty-based exploration
Gradient bandit algorithms, which learn action preferences rather than value estimates
Contextual bandits, which extend these ideas to situation-dependent action selection

Each method has its strengths and weaknesses, and their relative performance depends on the specific problem characteristics. UCB methods often perform best on stationary problems, while constant-α methods adapt better to nonstationary environments.

The exploration-exploitation dilemma extends far beyond bandit problems. It appears in various forms throughout reinforcement learning and is a fundamental challenge in any learning system that must make decisions based on limited information.

As we move from bandits to the full reinforcement learning problem, we'll see how these ideas extend to sequential decision-making, where actions affect not just immediate rewards but also future situations and opportunities. The methods we've explored here provide a foundation for understanding these more complex scenarios.

In our casino analogy, we're no longer just playing individual slot machines - we're navigating a complex casino where each decision affects not only our immediate winnings but also which games we'll have access to next. This is the full reinforcement learning problem, and it's the focus of the rest of our journey.

Reinforcement Learning: First Principles

Yash Patel — Wed, 05 Feb 2025 00:00:00 GMT

NoteThis blog is primarily based on my personal notes from Chapter 1 of Reinforcement Learning: An Introduction by Sutton and Barto, structured to improve Hey there! Welcome to my blog series on reinforcement learning (RL). I'm currently reading Rei...