<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Yash's Blog]]></title><description><![CDATA[Yash's Blog]]></description><link>https://blogs.yashpatel.xyz</link><generator>RSS for Node</generator><lastBuildDate>Sun, 12 Apr 2026 10:49:52 GMT</lastBuildDate><atom:link href="https://blogs.yashpatel.xyz/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Mixtral of Experts: Top-2 Routing Gives 47B Capacity at 13B Active Compute]]></title><description><![CDATA[At roughly 2.08e11 cumulative FLOPs in my own run, a dense baseline lands at 25.31 validation perplexity. At nearly the same compute budget, a sparse MoE lands at 20.98. The absolute delta is 4.324, a]]></description><link>https://blogs.yashpatel.xyz/mixtral-of-experts-top-2-routing-gives-47b-capacity-at-13b-active-compute</link><guid isPermaLink="true">https://blogs.yashpatel.xyz/mixtral-of-experts-top-2-routing-gives-47b-capacity-at-13b-active-compute</guid><dc:creator><![CDATA[Yash Patel]]></dc:creator><pubDate>Sun, 22 Mar 2026 10:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/64f79bef975f099bf6d45d0b/085563ad-80ec-49be-a499-0c32e95ed803.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At roughly 2.08e11 cumulative FLOPs in my own run, a dense baseline lands at 25.31 validation perplexity. At nearly the same compute budget, a sparse MoE lands at 20.98. The absolute delta is 4.324, a 17.09% reduction in validation perplexity at almost the same training compute.</p>
<p>Before sparse MoE layers became practical, the scaling path was blunt. If I wanted more capability, I paid for a bigger dense feedforward stack on every token, every layer, every step. Training cost rose. Inference cost rose. Memory pressure rose. Production teams got trapped in a false binary: either ship a smaller model that misses quality targets, or ship a larger one with painful latency and infrastructure bills.</p>
<p>Mixtral addresses that exact bottleneck. Instead of running one dense FFN per token, each FFN block is replaced by a router plus multiple experts. The router picks only two experts for each token state. The model keeps large total parameter capacity while keeping per-token active compute bounded.</p>
<p>That is the key pre-paper pain and paper-level claim in one line: dense scaling ties quality to always-on compute, while sparse routing partially decouples them.</p>
<h3>THE PAPER'S CLAIM</h3>
<p>The paper frames dense transformer FFNs as the scaling bottleneck. Dense layers activate all FFN parameters for every token, so quality gains come with directly higher per-token compute. Mixtral's proposal is to replace dense FFNs with sparse MoE FFNs: 8 experts per layer, top-2 active experts per token, weighted recombination through router probabilities.</p>
<p>The central claim is not only architectural novelty. It is quality-per-compute at useful scale. The authors report Mixtral 8x7B outperforming Llama 2 70B on benchmarks such as MMLU, HellaSwag, and GSM8K, while approaching GPT-3.5-level results on many public evaluations. They pair that quality claim with an efficiency claim: 12.9B active parameters per token, not 46.7B, and materially faster inference than comparable dense 70B-class models.</p>
<h2>Mechanism</h2>
<p>Mixtral is still a decoder-only Transformer. The QKV path is unchanged from my attention walkthrough in <a href="https://yashpatel.xyz/blog/attention-is-all-you-need-what-the-paper-s-heads-are-actually-doing-at-each-layer">attention-is-all-you-need</a>. The architectural shift is local but deep: the feedforward block in each Transformer layer is replaced by a sparse mixture of experts block.</p>
<p>At system level, one token at layer 𝑙 does this:</p>
<ol>
<li><p>Runs attention as usual.</p>
</li>
<li><p>Enters a router.</p>
</li>
<li><p>Router picks top-2 experts out of 8.</p>
</li>
<li><p>Token is processed by only those experts.</p>
</li>
<li><p>Expert outputs are weighted and added.</p>
</li>
</ol>
<h3>Component Breakdown</h3>
<p>For each MoE layer:</p>
<ul>
<li><p>Router: one linear map from hidden state to 8 logits.</p>
</li>
<li><p>Top-k selection: keep two highest-logit experts.</p>
</li>
<li><p>Expert FFNs: in Mixtral-style implementations, each expert is a SwiGLU MLP.</p>
</li>
<li><p>Aggregation: weighted sum of selected expert outputs.</p>
</li>
</ul>
<p>In my artifact code, this lives in <code>moe_core.py</code> with <code>MoEFeedForward</code> and <code>SwiGLUExpert</code>. The dense baseline uses a SwiGLU FFN too, so dense and MoE FFN compute are compared within the same FFN family.</p>
<h3>Worked Token Example</h3>
<p>Take a token state vector <strong>𝑥ₜ ∈ ℝ^d_model</strong> for the word "latency". Assume the router emits logits over 8 experts:</p>
<p>$$z = [2.9, 0.3, 1.7, -0.1, 2.2, 0.4, -1.2, 0.1]$$</p>
<p>Top-2 indices are experts 0 and 4 (logits 2.9 and 2.2). I softmax only over those two values:</p>
<p>$$p = \text{softmax}([2.9, 2.2]) = [0.668, 0.332]$$</p>
<p>Then only two expert MLPs run:</p>
<p>$$y_t = 0.668 \cdot E_0(x_t) + 0.332 \cdot E_4(x_t)$$</p>
<p>Experts 1,2,3,5,6,7 are skipped for this token at this layer.</p>
<p>Next token can pick a different pair. Next layer can pick another pair again. That dynamic token-wise specialization is where the extra capacity comes from.</p>
<h3>Step-by-Step Computation Path</h3>
<p>I break one forward pass into the exact sequence that matters for performance and stability:</p>
<ol>
<li><p>Router projection computes 8 logits from one token hidden state.</p>
</li>
<li><p>Top-2 selection keeps expert IDs and gating weights.</p>
</li>
<li><p>Dispatch packs token activations by expert ID.</p>
</li>
<li><p>Each selected expert runs its own SwiGLU MLP on its assigned token slice.</p>
</li>
<li><p>Gather unpacks outputs to original token order.</p>
</li>
<li><p>Weighted combine applies gate weights and sums the two expert outputs.</p>
</li>
<li><p>Residual path adds the MoE output back to the layer stream.</p>
</li>
</ol>
<p>Each step is simple in isolation. The complexity appears in step 3 and step 5 at scale, where token-expert imbalance directly affects kernel efficiency and tail latency.</p>
<p>During training, backpropagation flows through the selected experts and the router probabilities that produced those selections. During inference, the same dispatch path runs without gradient tracking, but the same load patterns still decide latency behavior.</p>
<h3>Core Equations</h3>
<p>The routing objective has two parts: sparse dispatch and balanced utilization.</p>
<p>First, sparse dispatch:</p>
<p>$$y_t = \sum_{i \in \text{TopK}(x_t)} p_i(x_t) E_i(x_t), \quad K=2$$</p>
<p>Why this form makes sense: I want conditional compute, but I still need differentiable weighted composition among active experts.</p>
<p>Second, load balancing (Switch-style top-1 auxiliary):</p>
<p>$$L_{aux} = N \sum_{i=1}^{N} f_i p_i$$</p>
<p>where \(N\) is number of experts, \(f_i\) is fraction of tokens routed to expert \(i\) by top-1 assignment, and \(p_i\) is mean router probability mass for expert \(i\).</p>
<p>Why this form makes sense: if routing collapses to a few experts, \(f_i\) and \(p_i\) become skewed; minimizing this term pushes traffic and confidence toward a more uniform spread.</p>
<p>This exact equation is a Switch-style implementation choice in my artifact, not a Mixtral-specific closed-form requirement from the paper.</p>
<p>In my code, this is <code>auxiliary_load_balancing_loss(...)</code>, and tests verify it against the top-1 formula directly.</p>
<h3>FLOP Accounting Intuition</h3>
<p>The most useful engineering question is not "how many total parameters?" It is "how much math did I execute per token?"</p>
<p>Dense FFN forward FLOPs (SwiGLU baseline):</p>
<p>$$\text{FLOPs}{dense\ffn} = 6 B T d{model} d{ff}$$</p>
<p>MoE active FFN forward FLOPs with SwiGLU experts (3 linear projections) and top-2 routing:</p>
<p>$$\text{FLOPs}_{moe\active\ffn} = 6 B T d{model} d{ff_per_expert} K$$</p>
<p>If \(d_{ff_per_expert} = d_{ff}/8\) and \(K=2\):</p>
<p>$$\frac{\text{FLOPs}_{moe\active\ffn}}{\text{FLOPs}{dense\ffn}} = \frac{6 \cdot (d{ff}/8) \cdot 2}{6 \cdot d{ff}} = \frac{1}{4}$$</p>
<p>So active MoE FFN math is about 25% of dense FFN math under this setup, before adding router overhead. That is exactly why FLOP-matched races are necessary. Parameter counts alone can mislead teams.</p>
<p>What is interesting is where the practical complexity moves. Dense models spend most complexity in matrix sizes. Sparse MoE models spend it in routing stability, token dispatch, and kernel/runtime behavior.</p>
<h3>The Non-Obvious Part</h3>
<p>The hard part is not top-2 math. The hard part is determining whether routing is actually specializing or just staying near-uniform.</p>
<p>In this smoke run, final entropy is 2.0276 (no-aux) and 2.0596 (with-aux), while <strong>ln(8) = 2.0794</strong>. That is 97.51% and 99.04% of maximum entropy, so experts are not strongly specialized yet at this scale. Aux mainly nudges routing slightly flatter rather than creating sharp specialization.</p>
<p>That is why I track three signals together: validation perplexity, top1_share, and entropy. Perplexity alone misses routing behavior. At small scale, the bigger risk signal is early-phase routing noise and mild imbalance, not a confirmed late-stage collapse event.</p>
<h2>Reproduction</h2>
<p>I implemented two runnable artifacts in pure PyTorch:</p>
<ul>
<li><p><code>mixtral_isoflop_race.py</code>: dense vs MoE race at equal cumulative FLOP budget.</p>
</li>
<li><p><code>mixtral_router_balance.py</code>: no-aux vs with-aux routing behavior.</p>
</li>
</ul>
<p>Shared components live in <code>moe_core.py</code>.</p>
<p>Hardware baseline: RTX 3090, 24GB VRAM. The scripts include explicit VRAM checks and CPU fallback.</p>
<h3>Experiment Setup</h3>
<p>For the smoke run used here:</p>
<ul>
<li><p>Sequence length: 96</p>
</li>
<li><p>Batch size: 16</p>
</li>
<li><p>Model width: 192</p>
</li>
<li><p>Layers: 3</p>
</li>
<li><p>Experts: 8, top-k: 2</p>
</li>
<li><p>Dense FFN width: 768, per-expert width: 96</p>
</li>
</ul>
<p>I logged checkpointed metrics to:</p>
<ul>
<li><p><code>outputs/isoflop_race_metrics.csv</code></p>
</li>
<li><p><code>outputs/router_balance_metrics.csv</code></p>
</li>
</ul>
<h3>Result 1: FLOP-Matched Dense vs MoE</h3>
<p>From the final checkpoints in <code>isoflop_race_metrics.csv</code>:</p>
<pre><code class="language-text">Dense final: flops=207,920,037,888 | val_ppl=25.307996
MoE final  : flops=204,918,681,600 | val_ppl=20.983645
Delta      : 4.324351 perplexity (17.09%) in favor of MoE
</code></pre>
<p>The MoE endpoint is at ~1.44% lower cumulative FLOPs than dense in this smoke run. So this is not a mathematically perfect iso-point, but it is close enough to test directionally valid quality-per-compute behavior.</p>
<p>But this run also exposes a confounder that has to be disclosed:</p>
<pre><code class="language-text">Dense endpoint: step=12 | tokens_seen=18,432 | wall_seconds=1.719
MoE endpoint  : step=25 | tokens_seen=38,400 | wall_seconds=55.906
</code></pre>
<p>At equal FLOP budget, MoE consumed 108.33% more optimizer steps (and token batches) because each MoE step is cheaper. Wall-clock was 32.52x slower in this implementation. So this result is compute-normalized, but not update-count-normalized or wall-time-normalized.</p>
<p>I ran it. MoE starts slightly behind at the lowest budget checkpoints, then overtakes and stays ahead through the endpoint.</p>
<img alt="Validation perplexity over cumulative FLOPs for dense vs MoE" style="display:block;margin:0 auto" />

<h3>Result 2: Router Balance With and Without Aux Loss</h3>
<p>From <code>router_balance_metrics.csv</code> at step 300:</p>
<pre><code class="language-text">no_aux   : entropy=2.027621 | top1_share=0.182227 | val_ppl=10.259389
with_aux : entropy=2.059556 | top1_share=0.175938 | val_ppl=10.315585
</code></pre>
<p>Derived deltas:</p>
<ul>
<li><p>Top-1 concentration drops by 0.006289 absolute, a 3.45% relative reduction.</p>
</li>
<li><p>Entropy increases by 0.031935.</p>
</li>
<li><p>Validation perplexity is slightly worse with aux in this run (difference 0.056196).</p>
</li>
</ul>
<img alt="Router entropy over training steps with and without auxiliary load balancing" style="display:block;margin:0 auto" />

<p>At this scale, both runs remain near-uniform (entropy close to <strong>ln(8)</strong>), so this is not strong expert specialization yet. Aux still improves balance modestly (lower top1_share and higher entropy), but that regularization did not translate into a perplexity win in this specific run.</p>
<p>So the practical win from aux here is routing smoothness, not immediate quality lift.</p>
<p>I want these caveats explicit:</p>
<ul>
<li><p>This is a smoke-scale run, not full-scale pretraining.</p>
</li>
<li><p>Tokenizer is character-level and corpus is small compared to production pretraining mixes.</p>
</li>
<li><p>Single-seed evidence. Robustness claims need multi-seed and larger compute budgets.</p>
</li>
</ul>
<p>Even with those caveats, the mechanism-level behavior is clear and reproducible.</p>
<p>I like to include one small raw-style checkpoint block because it forces honesty. It is easy to tell a clean story with only endpoints. It is harder when intermediate points are visible:</p>
<pre><code class="language-text">[step 250] no_aux:   val_ppl=10.4865 | top1_share=0.1769 | entropy=2.0358
[step 275] no_aux:   val_ppl=10.2653 | top1_share=0.1785 | entropy=2.0278
[step 300] no_aux:   val_ppl=10.2594 | top1_share=0.1822 | entropy=2.0276

[step 250] with_aux: val_ppl=10.4874 | top1_share=0.1765 | entropy=2.0637
[step 275] with_aux: val_ppl=10.2944 | top1_share=0.1773 | entropy=2.0609
[step 300] with_aux: val_ppl=10.3156 | top1_share=0.1759 | entropy=2.0596
</code></pre>
<p>What stands out is that with-aux keeps entropy higher, but both runs still sit close to uniform routing and with-aux does not improve validation perplexity in this smoke run. For production teams, this is a reminder to treat aux as a routing-control tool and validate quality impact separately.</p>
<h2>Production Reality</h2>
<p>As of March 2026, MoE is no longer a paper-only architecture. It is a deployment pattern. The stack has matured, but it has a specific shape that matters for engineering decisions.</p>
<h3>Where it runs</h3>
<p>From Mistral's release notes and serving ecosystem docs:</p>
<ul>
<li><p>Mixtral introduced 46.7B total and 12.9B active parameters per token using top-2 routing.</p>
</li>
<li><p>Mistral explicitly pushed open-source deployment through vLLM integration with MegaBlocks kernels.</p>
</li>
<li><p>vLLM now exposes OpenAI-compatible APIs and includes explicit expert-parallel and routed-expert deployment paths in docs and examples.</p>
</li>
</ul>
<p>So in practice, teams deploy MoE models behind the same API contracts as dense models, but with a very different runtime under the hood.</p>
<h3>What changed since the paper</h3>
<p>The highest-impact shift is kernel and runtime specialization.</p>
<p>In the paper view, MoE looks like a clean routing equation. In production, throughput depends on runtime handling of uneven token-to-expert assignments. This is why fused MoE kernels, dispatch optimizations, and expert-parallel communication strategies became first-class concerns in serving systems.</p>
<p>Second, deployment topology became part of model quality engineering.</p>
<p>Top-2 routing quality is not enough. Routing traffic also has to map cleanly to multi-GPU or multi-node topology. When prompt distributions create hot experts, latency tails widen even when average throughput still looks good.</p>
<p>Third, API compatibility got easier, model formatting got stricter.</p>
<p>OpenAI-compatible serving lowered integration friction. But real deployments still trip on tokenizer and chat-template mismatches, generation defaults, and request-shape differences across model families.</p>
<h3>The production gotcha</h3>
<p>The biggest MoE gotcha is memory versus active compute.</p>
<p>Active compute can look like a much smaller dense model, but total expert weights still have to be resident for efficient serving. Compute savings do not remove memory pressure. If capacity planning ignores that, deployments either under-provision VRAM or pay network penalties from aggressive weight sharding/offload.</p>
<p>My recommendation is to treat MoE rollout as two separate design problems:</p>
<ol>
<li><p>Quality and compute economics at model level.</p>
</li>
<li><p>Routing traffic engineering at runtime level.</p>
</li>
</ol>
<p>Teams that only solve the first one get surprised in production.</p>
<h3>What I instrument before calling an MoE deployment healthy</h3>
<p>For dense deployments, teams usually track latency, throughput, and error rate. For MoE, those are necessary but not sufficient. I add routing-aware telemetry to every serving stack:</p>
<ul>
<li><p>Per-expert token load over time windows.</p>
</li>
<li><p>Top-1 share and entropy by routeable layer.</p>
</li>
<li><p>P50/P95/P99 latency split by prompt length bucket.</p>
</li>
<li><p>Cross-device transfer volume for expert dispatch.</p>
</li>
</ul>
<p>This catches the most common silent failure: average latency looks fine, but one or two experts go hot and drive tail latency regressions. Under aggregate-only throughput monitoring, that failure mode can sit in production for weeks.</p>
<h3>Rollout pattern that actually works</h3>
<p>The safest MoE rollout I have seen is staged:</p>
<ol>
<li><p>Shadow traffic with full routing metrics enabled.</p>
</li>
<li><p>Limited canary where latency SLOs are evaluated on tail, not mean.</p>
</li>
<li><p>Full rollout only after expert load distribution is stable across daily traffic cycles.</p>
</li>
</ol>
<p>Dense-to-MoE migration fails when teams treat it like a drop-in model swap. It is a model swap plus a routing system rollout. If either side is weak, the launch degrades quickly.</p>
<h2>THE Code</h2>
<p>Code and outputs are in this folder <a href="https://github.com/yash-61016/mixtral-of-experts"><code>mixtral-of-experts</code></a>:</p>
<ul>
<li><p><code>mixtral_isoflop_race.py</code>: FLOP-budget race between dense and MoE models.</p>
</li>
<li><p><code>mixtral_router_balance.py</code>: paired routing runs with and without auxiliary balancing.</p>
</li>
<li><p><code>moe_core.py</code>: shared model blocks, router loss, FLOP estimators, and diagnostics.</p>
</li>
</ul>
<p>Run with an RTX 3090 (24GB VRAM) for the intended experience. Smoke commands are in <code>README.md</code> and finish in minutes. Longer runs need a larger FLOP budget and step count to tighten confidence on the quality deltas.</p>
]]></content:encoded></item><item><title><![CDATA[LLaMA 2: How Three Borrowed Techniques Fit a 70B Model on Two GPUs]]></title><description><![CDATA[The Memory Problem
Serving 10 concurrent users with a 70B-scale model at 4K context, using the vanilla transformer architecture from 2017, requires roughly 240GB of GPU memory: about 140GB for weights]]></description><link>https://blogs.yashpatel.xyz/llama-2-how-three-borrowed-techniques-fit-a-70b-model-on-two-gpus</link><guid isPermaLink="true">https://blogs.yashpatel.xyz/llama-2-how-three-borrowed-techniques-fit-a-70b-model-on-two-gpus</guid><dc:creator><![CDATA[Yash Patel]]></dc:creator><pubDate>Sun, 15 Mar 2026 09:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/64f79bef975f099bf6d45d0b/f0d70439-cf5f-4a68-9251-1c21e30eb82d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Memory Problem</h2>
<p>Serving 10 concurrent users with a 70B-scale model at 4K context, using the vanilla transformer architecture from 2017, requires roughly 240GB of GPU memory: about 140GB for weights and about 100GB for the KV cache (the K and V tensors stored per token to avoid recomputing attention from scratch on each decode step). Two A100-80GBs give you 160GB. The math doesn't work.</p>
<p>Three techniques separate every modern open-weight model from the vanilla 2017 transformer: RoPE, RMSNorm, and GQA. None of them originated in LLaMA 2. RoPE came from Su et al. (2021). RMSNorm came from Zhang and Sennrich (2019). GQA came from Ainslie et al. (2023). What the LLaMA lineage did was consolidate them into a single coherent architecture that the entire open-source ecosystem built around. LLaMA 1 brought RoPE and RMSNorm into wide use. LLaMA 2 added GQA for its larger models (specifically the 34B and 70B) and doubled the context window to 4K. The 7B and 13B still use standard multi-head attention.</p>
<p>That is why every open-weight model released since mid-2023 shares the same config.json backbone: <code>num_key_value_heads</code>, <code>rope_theta</code>, <code>rms_norm_eps</code>. Not because LLaMA 2 invented them. Because LLaMA 2 was the first widely-adopted open model to ship all three together at scale.</p>
<p><a href="https://yashpatel.xyz/blog/attention-is-all-you-need-what-the-paper-s-heads-are-actually-doing-at-each-layer">Post 1</a> covered the attention mechanism: QKV projections, scaled dot product, causal mask. This post covers what those three techniques actually do and why the combination works. Together they explain why <code>num_key_value_heads</code> differs from <code>num_attention_heads</code> in most modern decoder configs, what raising <code>rope_theta</code> from 10,000 to 500,000 actually does to context length, and why the open-weight ecosystem almost universally abandoned <code>LayerNorm</code> after 2022.</p>
<hr />
<h2>The Paper's Claim</h2>
<p>By mid-2023, the largest open-weight model was LLaMA 1 65B: competitive on several benchmarks, trained on 1.4T tokens, capped at 2K context, and requiring eight A100s for practical serving. Touvron et al. argued that three existing techniques, none original to Meta, could close most of the efficiency and quality gap: RoPE for position encoding that generalises to longer sequences, RMSNorm for stable training at scale, and GQA to cut the KV cache footprint by up to 8x at inference. LLaMA 2 70B, trained on 2T tokens with these three architectural changes, scores 68.9 on MMLU 5-shot, up 5.5 points from LLaMA 1 65B's 63.4 and within 1.1 points of GPT-3.5 at 70.0. GQA specifically brings the 70B's KV cache from around 100GB to around 12.5GB at a 10-user load, the number that makes two-GPU deployment realistic at that scale.</p>
<hr />
<h2>RoPE, RMSNorm, and GQA</h2>
<p>A LLaMA 2 70B forward pass is the same transformer Vaswani et al. described in 2017 at the system level. Token embeddings enter, 80 identical decoder blocks process them via residual streams, and a final linear projection produces logits over the vocabulary. The causal mask, the QKV attention sub-layer, the feed-forward sub-layer, the skip connections: all unchanged. Three specific components inside each decoder block differ from the 2017 original: how position is injected into queries and keys before the dot product (RoPE, replacing sinusoidal encodings), how each sub-layer's inputs are stabilised before the linear projections (RMSNorm, replacing LayerNorm), and how key-value tensors are distributed across the 64 attention heads (GQA with 8 KV heads, replacing full MHA). Those three changes are the entire architectural delta from the vanilla transformer.</p>
<p>Three separate failure modes, three separate fixes. RoPE (Su et al. 2021) fixes positional encoding's length generalisation problem. RMSNorm (Zhang and Sennrich 2019) removes a GPU synchronisation barrier from every forward pass. GQA (Ainslie et al. 2023) cuts the KV cache footprint by up to 8x without touching the attention math.</p>
<p>Take the token "model" at position 42 in a 4,096-token input, processed by the first decoder block of LLaMA 2 70B (d_model=8192, 64 query heads, 8 KV heads, head dimension d_k=128). That single token passes through all three modified components before any output is produced. Here is what each one does.</p>
<h3>RoPE: Position as Rotation</h3>
<p>The original transformer encodes position by adding a sinusoidal vector to each token's embedding before the first attention layer. Expand the dot product between query q at position m and key k at position n:</p>
<p>$$(\mathbf{q} + \text{PE}[m]) \cdot (\mathbf{k} + \text{PE}[n]) = \mathbf{q} \cdot \mathbf{k} + \mathbf{q} \cdot \text{PE}[n] + \text{PE}[m] \cdot \mathbf{k} + \text{PE}[m] \cdot \text{PE}[n]$$</p>
<p>The cross terms <code>q·PE[n]</code> and <code>PE[m]·k</code> depend on absolute positions m and n individually. There is no factorization that reduces this to a function of relative offset (m - n) alone. A model trained at 2K tokens has no algebraic guarantee that the positional signal at positions 500 and 510 transfers to positions 5,500 and 5,510. The model learns an approximation from data. It fails cleanly beyond training length.</p>
<p>Rotary Position Embedding (RoPE), introduced by Su et al. in "RoFormer" (2021) and adopted by LLaMA 1, applies rotations to Q and K inside the attention operation, after the linear projections but before the dot product. Each consecutive dimension pair (x[2k], x[2k+1]) is treated as a 2D point and rotated by an angle proportional to position. The rotation angles <code>θₖ = 10000^(−2k/d)</code> are fixed, not learned. For a token at position m, dimension pair k receives rotation m·θₖ:</p>
<p>$$x_{\text{rot}}[2k] = x[2k]\cos(m,\theta_k) - x[2k+1]\sin(m,\theta_k)$$</p>
<p>$$x_{\text{rot}}[2k+1] = x[2k]\sin(m,\theta_k) + x[2k+1]\cos(m,\theta_k)$$</p>
<p>Rotation is isometric: the norm of the vector is preserved. Additive positional encodings inflate vector norms by the PE magnitude, shifting softmax distributions in proportion to position rather than content. Rotation keeps attention logit scale unchanged.</p>
<p>What's interesting is what happens to the dot product after applying this rotation to both Q and K: the result depends only on content and relative distance (m - n). The algebraic proof is clean. The consequence is meaningful: a model trained at 4K context sees the same rotational relationship between tokens 100 and 200 as between tokens 4100 and 4200. This is why context extension works by scaling the base frequency schedule rather than full retraining.</p>
<p>Concrete numbers using LLaMA 2's head dimension (dₖ = 128, so 64 dimension pairs). The token "model" at position 42 receives rotation 42 × θ₀ = 42.0 radians on its fastest dimension pair (k = 0, completing about 6.7 full cycles) and 42 × θ₆₃ ≈ 0.0049 radians on its slowest (k = 63, barely a nudge). The 8,600x frequency spread across 64 pairs is the mechanism: fast-rotating pairs distinguish nearby tokens within a few positions; slow-rotating pairs retain signal across hundreds of tokens. Both happen simultaneously, in different dimensions of the same 128-dim head.</p>
<p>One non-obvious consequence: position 0 always receives identity rotation (m=0 makes all angles zero). The first token in every sequence, typically a BOS token, carries zero positional information from RoPE. Its attention patterns are driven entirely by content geometry. This explains why layer-0 attention heads often appear to attend uniformly across the sequence: there is no positional bias at the sequence start, only content similarity.</p>
<img alt="RoPE rotation angles heatmap" style="display:block;margin:0 auto" />

<p><em>Rotation angles freqs[position, k] mod 2π for 64 positions and 64 dimension pairs. Low-index pairs (bottom) complete multiple rotations within a short span. High-index pairs (top) are nearly flat. This frequency spread is what makes RoPE's locality bias multi-scale.</em></p>
<h3>RMSNorm: Removing the Synchronisation Barrier</h3>
<p>LayerNorm normalizes each activation vector in two sequential passes: subtract the mean (re-centering), then divide by the standard deviation (re-scaling). On GPU, the mean computation requires a barrier synchronisation across the full activation tensor before the variance step can begin. At large hidden dimensions and batch sizes, this barrier is a measurable throughput cost.</p>
<p>Root Mean Square Normalization (RMSNorm) drops re-centering entirely. Normalization is magnitude-only:</p>
<p>$$\text{RMSNorm}(\mathbf{x}) = \frac{\mathbf{x}}{\text{RMS}(\mathbf{x})} \cdot \gamma, \qquad \text{RMS}(\mathbf{x}) = \sqrt{\frac{1}{n}\sum_i x_i^2 + \varepsilon}$$</p>
<p>One pass. One learned scale γ per dimension. No mean subtraction, no learned shift parameter β. For the token "model" at position 42, this means the 8,192-dimensional activation vector entering each sub-layer is divided by its own RMS magnitude before the Q/K/V projections run. One scalar division, no synchronisation point. Zhang and Sennrich (2019) report 7-64% speedup over LayerNorm depending on architecture; for transformer blocks the gain is in the 7-9% range. At 80 layers across millions of training tokens, that compounds.</p>
<p>LLaMA 1 adopted RMSNorm as pre-norm: before each sub-layer, not after. The original transformer used post-norm. At 70B+ scale, post-norm produces unnormalized residuals that accumulate through the stack and reach the loss before any stabilization. Training diverges early. GPT-3 adopted pre-norm and documented the result. Every large model trained since followed.</p>
<h3>GQA: Asymmetric Caching</h3>
<p>Every autoregressive decode step computes Q, K, and V for the new token, appends K and V to the cache, then reads the entire cached K and V to compute attention over the full context. The query is used once and discarded. Keys and values persist for the full lifetime of the sequence.</p>
<p>Grouped Query Attention (GQA), introduced by Ainslie et al. (2023), is built around this asymmetry. Query heads are never cached, so the number of query heads has zero effect on KV cache size. Only KV heads consume persistent memory. LLaMA 2 adopted GQA for its 34B and 70B models; the 7B and 13B use standard MHA. The 70B uses 64 query heads and 8 KV heads: every group of 8 query heads shares the same K and V projections. The token "model" at position 42 has its K and V projections written to one of those 8 KV slots. All 8 query heads in the same group attend over that single K/V entry. The attention computation inside each group is unchanged. Quality loss is negligible: Ainslie et al. report ROUGE-1 degradation within 0.3 points on any individual dataset and 0.1 points on average at T5 XXL scale.</p>
<p>The formula makes the savings concrete:</p>
<p>$$\text{KV bytes} = 2 \times \text{batch} \times \text{layers} \times \text{seq_len} \times \text{num_kv_heads} \times \text{head_dim} \times \text{dtype_bytes}$$</p>
<p>At seq=4096, batch=10, float16:</p>
<pre><code class="language-plaintext">Full MHA (64 KV heads):  2 × 10 × 80 × 4096 × 64 × 128 × 2  ≈  100 GB
GQA     ( 8 KV heads):  2 × 10 × 80 × 4096 ×  8 × 128 × 2  ≈  12.5 GB
</code></pre>
<p>The reason this matters at inference is memory bandwidth, not just total capacity. Each decode step reads the entire KV cache from DRAM. An A100 delivers 312 TFLOP/s of compute but only 2 TB/s of memory bandwidth. The compute units wait on the memory bus. GQA's 8x reduction in KV heads translates directly to 8x fewer bytes transferred per decode step. Latency at decode time tracks this reduction closely, since memory bandwidth is the binding constraint at the 70B scale.</p>
<p>The connection back to <a href="https://yashpatel.xyz/blog/attention-is-all-you-need-what-the-paper-s-heads-are-actually-doing-at-each-layer">post 1</a>: Flash Attention solves the O(n²) score matrix problem by rewriting the attention kernel. GQA solves a different problem. Even with Flash Attention eliminating the materialized score matrix, the KV cache still grows linearly with sequence length and linearly with num_kv_heads. Flash Attention does not touch the cache. GQA removes the second multiplier.</p>
<hr />
<h2>What I Built and What I Found</h2>
<p><strong>Artifact 1</strong> implements RoPE in 200 lines of pure NumPy, with no PyTorch or HuggingFace dependency. Hardware: CPU-only, runtime around 45 seconds. The verification strategy uses the complex-number representation: <code>z_rot = z * exp(i * theta)</code> is mathematically equivalent to the real-valued 2D rotation matrix and serves as ground truth.</p>
<pre><code class="language-plaintext">[Verification vs. complex reference]
  Max |q_complex - q_ours| : 0.00e+00
  Max |k_complex - k_ours| : 0.00e+00
  Expected: &lt; 1e-14  (exact same computation, different notation)
</code></pre>
<p>Zero error. That is not an approximation matching a reference; it is two representations of the same geometric operation. The numerical proof of the relative-distance property over 20 random token pairs at arbitrary positions gives max error less than 1e-12. Floating-point rounding. The property is exact.</p>
<p>What the paper doesn't mention: the attention decay curve for sinusoidal PE is not monotonic. I measured E[|q · k|] as a function of relative distance d, averaging over 200 random unit-norm 128-dimensional vector pairs. RoPE decays from ~0.07 at d=0 to ~0.01 at d=127. Clean. Sinusoidal encoding behaves differently: pe[0] = [0,1,0,1,...] has norm \(\sqrt{64} \approx 8\), which completely dominates unit-norm content vectors. The sinusoidal signal at d=0 sits near 1.0, overwhelmingly driven by position rather than content, then oscillates with no consistent directional trend across the 128-token window.</p>
<img alt="RoPE vs sinusoidal attention decay" style="display:block;margin:0 auto" />

<p><em>E[|q · k|] vs relative distance. RoPE (blue) decays monotonically from ~0.07. Sinusoidal (orange) starts near 1.0 because pe[0] dominates unit-norm content vectors, then shows no consistent directional decay. Log scale on the right confirms RoPE's clean monotonic tail.</em></p>
<p>A model trained with sinusoidal PE must learn locality bias entirely from data. RoPE provides it geometrically. That distinction matters most when extrapolating to sequence lengths not seen during training.</p>
<p><strong>Artifact 2</strong> allocates real PyTorch KV cache tensors on an RTX 3090 and measures <code>torch.cuda.memory_allocated()</code> directly. No model weights loaded. The benchmark sweeps four architectures across sequence lengths 128-4096 and computes maximum batch size under a 20GB VRAM budget.</p>
<p>Results at seq=4096, batch=1:</p>
<pre><code class="language-plaintext">  GPT-2 XL  (MHA, 25 KV heads, 48 layers)  : 1.172 GiB
  LLaMA-2 7B (MHA, 32 KV heads, 32 layers)  : 2.000 GiB
  LLaMA-2 70B (GQA, 8 KV heads, 80 layers)  : 1.250 GiB
  MQA variant  (1 KV head,  80 layers)       : 0.156 GiB

  70B hypothetical MHA (64 KV heads):        10.000 GiB
  Actual 70B GQA  (8 KV heads):               1.250 GiB  (8.0x reduction)
</code></pre>
<p>Maximum batch under 20GB budget at seq=4096:</p>
<pre><code class="language-plaintext">  GPT-2 XL     (MHA):  max_batch = 17
  LLaMA-2 7B   (MHA):  max_batch = 10
  LLaMA-2 70B  (GQA):  max_batch = 16
  70B hypothetical MHA: max_batch =  2
</code></pre>
<p>The 70B GQA model serves 16 concurrent sequences within 20GB. Without GQA, the same model serves 2. The 7B MHA model serves 10 sequences, fewer than the 70B GQA, because LLaMA 2 7B's 32 KV heads at 32 layers produce a larger per-token KV footprint than the 70B's 8 KV heads at 80 layers. Counter-intuitive: model parameter count is not the right proxy for KV memory footprint at inference. Head count and layer count are.</p>
<img alt="KV cache memory vs sequence length" style="display:block;margin:0 auto" />

<p><em>KV cache GiB at batch=1 across four architectures. The 70B GQA line (1.250 GiB at seq=4096) stays below the 7B MHA line (2.000 GiB) even though the model is 10x larger: 8 KV heads × 80 layers beats 32 KV heads × 32 layers.</em></p>
<img alt="Max batch under 20GB budget" style="display:block;margin:0 auto" />

<p><em>Maximum concurrent sequences under a 20GB KV cache budget at seq=4096. MQA reaches 120+ at short contexts. The 70B GQA (16 sequences) outperforms the 7B MHA (10 sequences), and gives 8x more capacity than hypothetical 70B MHA (2 sequences).</em></p>
<p>I found that measured memory sits slightly above the analytical formula for small tensors. At seq=128 for the GPT-2 XL configuration, the empirical-to-theoretical ratio is approximately 1.010. The CUDA allocator rounds allocations to internal block boundaries. At seq=4096 the ratio rounds to 1.000. The GQA paper reports memory savings using the analytical formula only. At production sequence lengths the formula is exact; at short contexts it overstates savings by 1-3%.</p>
<hr />
<h2>Two Years Later</h2>
<p><strong>Where it runs:</strong> LLaMA 3 (8B, 70B, 405B), Mistral 7B and its derivatives, Gemma 2 (9B, 27B), Qwen 2.5, Phi-3 Medium. Every <code>modeling_llama.py</code> in HuggingFace implements all three mechanisms. vLLM, TGI, and TensorRT-LLM build their attention kernels around the GQA layout (from Ainslie et al. 2023) and RoPE rotations (from Su et al. 2021).</p>
<p><strong>What's changed:</strong></p>
<p>Context length is the dimension that has moved furthest. LLaMA 1 targeted 2K, LLaMA 2 doubled to 4K. LLaMA 3.1 extended to 128K via additional long-context training with one specific lever: <code>rope_theta</code>. LLaMA 2 uses <code>rope_theta = 10,000</code>, the default from the original RoPE paper (Su et al. 2021). LLaMA 3 bumped it to 500,000. Raising <code>rope_theta</code> shifts every dimension pair's rotation speed slower, so the slow-rotating pairs retain meaningful discrimination capacity at much greater token distances. The 50x increase in <code>rope_theta</code> roughly tracks the 50x increase in effective context. YaRN (2023) and the LLaMA 3 technical report both identify <code>rope_theta</code> scaling as the primary lever for context extension without retraining from scratch.</p>
<p>GQA is effectively the default for open-weight decoder models released since 2023: Mistral 7B, LLaMA 3 (all sizes), Qwen 2.5, and Phi-3 all ship it. The earlier debate between GQA and MQA (single shared KV head) settled in favor of GQA: MQA is measurably worse above 34B parameters, where the single key-value representation bottlenecks model capacity.</p>
<p>In decoder-only models at 7B+ scale, RMSNorm has effectively replaced LayerNorm across every major open-weight release since 2023.</p>
<p><strong>The production gotcha:</strong> Model weights are a fixed one-time memory cost. The KV cache is not. It grows with every token generated, every concurrent user, every active session. At 100 concurrent users each holding a conversation at 8K context, a 70B GQA model needs:</p>
<pre><code class="language-plaintext">2 × 100 × 80 × 8192 × 8 × 128 × 2  ≈  268 GB of KV cache
</code></pre>
<p>That is roughly 1.9x the model weights. The model fits on two A100s. The conversations do not. This is exactly why vLLM implements PagedAttention and why KV cache eviction exists in every production serving stack. GQA makes the per-token cost 8x smaller. At high concurrency and long context, the cache still dominates. LLaMA 2 identifies the mechanism; the serving systems covered later in this series are built around managing it.</p>
<hr />
<h2>The Code</h2>
<p>Both artifacts are in <a href="https://github.com/yash-61016/llama-2"><code>llama-2/</code></a>.</p>
<p><strong>Artifact 1</strong> (<code>rope_from_scratch.py</code>): RoPE in pure NumPy, no PyTorch. Proves the relative-distance property over 20 random position pairs, verifies against a complex-number reference (max error 0.00e+00), and generates the decay curve comparison against sinusoidal PE and the rotation angle heatmap. Hardware: CPU-only. Runtime: ~45 seconds.</p>
<p><strong>Artifact 2</strong> (<code>gqa_kvcache_benchmark.py</code>): Allocates real KV cache tensors on GPU for four architecture configurations, measures memory with <code>torch.cuda.memory_allocated()</code>, sweeps seq 128-4096, computes max batch under a 20GB budget, and generates the memory and batch capacity charts. Hardware: RTX 3090 (24GB VRAM), CUDA required. Runtime: ~25 seconds.</p>
<p>Dependencies: PyTorch 2.1.0, NumPy 1.24.3, Matplotlib 3.7.1.</p>
]]></content:encoded></item><item><title><![CDATA[Attention Is All You Need: What the Paper's Heads Are Actually Doing at Each Layer]]></title><description><![CDATA[Every production LLM you interact with today, LLaMA 3, Mistral, Gemma, Claude, runs on multi-head attention as its core computation. The paper that introduced it, "Attention Is All You Need" (Vaswani ]]></description><link>https://blogs.yashpatel.xyz/attention-is-all-you-need-what-the-paper-s-heads-are-actually-doing-at-each-layer</link><guid isPermaLink="true">https://blogs.yashpatel.xyz/attention-is-all-you-need-what-the-paper-s-heads-are-actually-doing-at-each-layer</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[nlp]]></category><category><![CDATA[transformers]]></category><category><![CDATA[llm]]></category><category><![CDATA[Artificial Intelligence]]></category><dc:creator><![CDATA[Yash Patel]]></dc:creator><pubDate>Sun, 08 Mar 2026 09:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/64f79bef975f099bf6d45d0b/270843ed-ac11-4d1d-826f-48bf01ee2c4d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every production LLM you interact with today, LLaMA 3, Mistral, Gemma, Claude, runs on multi-head attention as its core computation. The paper that introduced it, "Attention Is All You Need" (Vaswani et al., 2017), proved the mechanism works and observed in passing that different heads appear to learn different things. That observation is in the paper. The measurement is not.</p>
<p>The paper never quantifies what happens to head specificity as you go deeper into the network. No entropy measurement, no gradient from early to late layers, no empirical signal about how the network organises itself internally across depth. It shows a few hand-selected attention visualizations and moves on. Nine years later, every fine-tuning guide tells you to "freeze the early layers" without explaining what those layers are actually doing differently from the late ones.</p>
<p>That gap is what this post fills. I measured Shannon entropy per head across all 12 layers and 12 heads of GPT-2 small on 100 varied English sentences. The result is a clean monotonic gradient: early-layer heads attend broadly (mean entropy 1.42 nats), late-layer heads lock onto specific tokens (mean entropy 0.50 nats). The two populations are nearly 3x apart. This is not a subtle effect.</p>
<p>If you are deciding how many layers to freeze during LoRA fine-tuning, or debugging why a model attends to the wrong tokens at inference, understanding this gradient is the starting point. The paper gives you the mechanism. This post gives you the empirical structure inside it.</p>
<hr />
<h2>How Attention Actually Works</h2>
<p>Scaled dot-product attention replaces the sequential state update of an RNN with a direct query over every token in parallel. The core idea in two sentences: for each token, compute a compatibility score against every other token via a dot product, normalize those scores with softmax, and blend the corresponding value vectors by those weights. This produces a context-aware representation of every token in a single matrix multiply: no sequential dependency, no stored hidden state.</p>
<p>The paper formalizes this as:</p>
<p><strong>Attention(Q, K, V) = softmax(QK^T / √d_k) · V</strong></p>
<p><strong>Q, K, V are not the same matrix.</strong> Every token is projected three times with separate learned weights. W_Q produces the vector that gets dot-producted against other tokens' keys to produce compatibility scores. W_K produces the vector that gets matched against other tokens' queries. W_V produces the vector whose weighted blend becomes the output. The same token "bank" produces three different vectors: a Q that scores high against financial-context keys, a K that scores high against tokens querying for nouns, and a V carrying its semantic content. Separating these three roles is what lets different heads specialise: one head's Q-K scoring geometry can learn syntactic adjacency while another's learns semantic relatedness.</p>
<p><strong>The √d_k scaling is not cosmetic.</strong> With d_k=64, a dot product between two random 64-dimensional vectors has expected magnitude ~8, because variance grows linearly with dimension. Without scaling, large values push softmax into its saturated regime: all probability mass collapses onto one token and gradients vanish. Dividing by √64=8 keeps input variance at ~1.</p>
<p><strong>Multi-head attention</strong> runs 8 independent attention passes over d_k=64-dimensional subspaces, then concatenates and projects the results. The motivation: a single softmax over 512 dimensions tends to collapse into near-one-hot distributions, losing the ability to track multiple relationships simultaneously. Eight heads over 64 dimensions each costs roughly the same as one head over 512 dimensions. The key insight is that you split before computing attention, not after.</p>
<p>The implementation detail that trips people: the heads are not run in a loop. The input starts as <code>(batch, seq, d_model)</code>. A view and transpose converts it to <code>(batch, heads, seq, d_k)</code>, placing the heads dimension where PyTorch's matmul can treat them as independent batch dimensions. With shape <code>(1, 8, 10, 64)</code>, a single <code>torch.matmul(Q, K.transpose(-2, -1))</code> produces all 8 score matrices simultaneously. The compute graph is identical to 8 separate matrix multiplies, but the batched form maps to a single CUBLAS kernel call.</p>
<p>The causal mask for decoder self-attention enforces that token i cannot attend to position j &gt; i. The upper triangle of the score matrix is set to negative infinity before softmax fires. Since exp(-inf) = 0 exactly, future tokens contribute zero weight to the output, and the row sums remain exactly 1 without any additional normalisation step:</p>
<pre><code class="language-plaintext">causal mask, 5 tokens:
token 0: [  s00, -inf, -inf, -inf, -inf ]  ← only sees itself
token 1: [  s10,  s11, -inf, -inf, -inf ]
token 2: [  s20,  s21,  s22, -inf, -inf ]
</code></pre>
<p><strong>Positional encoding</strong> patches the mechanism's intrinsic blindness to order. Attention is a set operation: permuting the input tokens produces the same weighted sums, just permuted. Word order is invisible without an explicit signal. The paper injects position by adding sinusoidal vectors to the input embeddings before the first layer. Each of the 512 dimensions oscillates at a different frequency: the first dimension cycles every two positions; the last completes one full cycle across 10,000 positions. The model can infer absolute position from the joint oscillation pattern across all 512 dimensions. Practically, this was superseded by RoPE in modern LLMs, but the requirement remains: position information must be injected explicitly.</p>
<hr />
<h2>What I Found Running It</h2>
<p>I implemented scaled dot-product attention and multi-head attention in 200 lines of pure PyTorch, without using <code>torch.nn.MultiheadAttention</code>. Every intermediate tensor shape is annotated inline. Hardware: RTX 3090 (24GB VRAM). Library versions: torch==2.1.0, transformers==4.38.2.</p>
<p><strong>What matched:</strong> Running my implementation against <code>F.scaled_dot_product_attention</code> on identical inputs with a causal mask gives a max absolute difference of 1.19e-07. The implementations agree to floating-point precision.</p>
<pre><code class="language-plaintext">Max absolute difference vs PyTorch reference : 1.19e-07
PASS  implementation matches reference
</code></pre>
<p><strong>I found that:</strong> the VRAM cost of the score matrix hits numbers that make batching impossible far earlier than intuition suggests. The score matrix Q·K^T has shape <code>(batch × heads × n × n)</code> in float32, meaning 4 bytes × n² entries per layer. I ran forward passes at five sequence lengths on a single-layer toy model on the RTX 3090:</p>
<pre><code class="language-plaintext"> Seq length    Peak VRAM      Theoretical score matrix
──────────────────────────────────────────────────────
          64      22 MB            2.6 MB
         128      24 MB           10.5 MB
         256      33 MB           41.9 MB
         512      68 MB          167.8 MB
        1024     241 MB          671.1 MB
</code></pre>
<img alt="VRAM scaling chart" style="display:block;margin:0 auto" />

<p><em>Peak VRAM measured on RTX 3090 across five sequence lengths. The quadratic fit confirms O(n^2) growth: at n=1024, the score matrix alone reaches 671MB for a single-layer model.</em></p>
<p>At n=1024, the score matrix for a single-layer toy model reaches 671 MB. Scale that to GPT-3's 96 layers and you get the number that made Flash Attention (2022) a necessity, not an optimisation.</p>
<p><strong>What the paper doesn't measure:</strong> Shannon entropy falls monotonically with layer depth. I ran GPT-2 small (117M parameters, 12 layers × 12 heads) over 100 English sentences covering SVO, relative clauses, passive constructions, and coreference. Per head I measured Shannon entropy (how diffuse the attention distribution is) and diagonal score (fraction of attention within ±2 positions of the diagonal, as a proxy for purely positional heads). I classified all 144 heads into four empirical types: local (positional), copy (locks onto one or two tokens), broad (attends widely), and mixed.</p>
<pre><code class="language-plaintext">KEY FINDING: layer-depth entropy gradient
  Early layers (0-3)  mean entropy : 1.421 nats
  Late  layers (8-11) mean entropy : 0.497 nats
  Gradient            (late-early) : -0.924 nats
</code></pre>
<img alt="Head layer depth gradient" style="display:block;margin:0 auto" />

<p><em>Left: mean attention entropy per layer in GPT-2 small across 100 sentences. Right: head type distribution per layer. Early layers are dominated by local and broad heads. Late layers converge on copy and mixed heads.</em></p>
<img alt="Head entropy heatmap" style="display:block;margin:0 auto" />

<p><em>Mean Shannon entropy per head across 12 layers x 12 heads. Light cells are sharp and focused (low entropy). Dark cells are diffuse (high entropy).</em></p>
<p>Early layers are nearly 3x more diffuse than late layers. The paper shows hand-selected attention visualizations and notes that "different heads learn to perform different tasks." It never quantifies what happens to head specificity as depth increases. Layer 0 looks almost uniform; Layer 11 looks like a spiked distribution.</p>
<p>I found that the head classification requires checking diagonal score before entropy. A local head, one that only attends to the token immediately to its left, can have very low entropy because its distribution is also sharply peaked. Checking entropy first mislabels it as a copy head. The diagonal check catches it correctly as a positional head. This ordering matters if you are building any automated attention analysis tooling.</p>
<hr />
<h2>Where It Runs in 2026</h2>
<p><strong>Where it runs:</strong> Every transformer-family model in production. LLaMA 2/3, Mistral, Gemma, Falcon, GPT-4, Claude: all implement direct descendants of the mechanism in this paper. PyTorch's <code>nn.MultiheadAttention</code>, every Hugging Face model's attention module, and every CUDA kernel in vLLM, TGI, and TensorRT-LLM trace back to Section 3.2.</p>
<p><strong>What's changed since the paper:</strong></p>
<p>The Table 3 configuration (d_model=512, 8 heads, 6 layers, sinusoidal positional encoding) is a toy by current standards. The biggest change is positional encoding. Sinusoidal fixed encodings, as used in the original paper, were replaced by Rotary Position Embeddings (RoPE) in LLaMA and Mistral and by ALiBi in MPT. RoPE applies a rotation matrix to Q and K inside the attention operation rather than adding position to the input embedding. This gives better length generalisation and sharper relative distance signal, which is why it is the default choice for every model targeting 128K+ context windows.</p>
<p>The other structural change with comparable impact is the shift from encoder-decoder to decoder-only. GPT, LLaMA, and Mistral drop the encoder entirely. Cross-attention between encoder and decoder is replaced by in-context conditioning through the causal attention mask. The masked decoder self-attention from Section 3.2.3 is the only attention mechanism in every autoregressive LLM today. The encoder half of the original paper is now primarily relevant for embedding models like BERT and its descendants.</p>
<p>Pre-norm (LayerNorm before each sub-layer) replaced post-norm because post-norm causes gradient explosions at 70B+ scale. Grouped Query Attention (GQA), used in LLaMA 2/3 70B, shares 8 K/V heads across 32 query heads, cutting KV cache by 4x with under 1% accuracy loss.</p>
<p><strong>The production gotcha:</strong> KV cache memory at long contexts exceeds model weights. LLaMA-3-70B has approximately 70GB of weights in float16. With GQA, each token in the KV cache costs: 80 layers × 8 KV heads × 128 d_k × 2 (K and V) × 2 bytes = ~327KB per token. At a 128K context, that is ~42GB of KV cache. Without GQA (full multi-head), it would be ~168GB, more than double the model weights. Batching 10 concurrent users at 128K context without GQA would require 1.7TB of KV cache. This is why GQA, Multi-Query Attention, and KV cache quantization exist in every production serving stack: the scaling law for the attention mechanism bites in production before it bites in benchmarks.</p>
<hr />
<h2>When to Use It, When Not To</h2>
<p><strong>USE WHEN:</strong></p>
<ul>
<li><p>Input length is &lt; 8K tokens and you need all-to-all relationships: standard fine-tuning on classification, summarisation, or translation tasks</p>
</li>
<li><p>Training on a multi-GPU cluster where the parallelization benefit of attention over RNNs is the primary constraint</p>
</li>
<li><p>The task requires resolving long-range dependencies (coreference, discourse coherence) that CNNs cannot reach in a single pass</p>
</li>
<li><p>You want interpretable attention patterns for debugging or analysis</p>
</li>
</ul>
<p><strong>DON'T USE WHEN:</strong></p>
<ul>
<li><p>Sequence length regularly exceeds 8K tokens in production without Flash Attention: the O(n²) score matrix makes large batches impossible at naive float32</p>
</li>
<li><p>You're targeting sub-100MB VRAM inference: at n=1024 the score matrix alone is 671MB per layer</p>
</li>
<li><p>You need O(1) memory per token for streaming inference: attention requires the full KV context (KV cache grows linearly with sequence length)</p>
</li>
</ul>
<p><strong>USE ALTERNATIVE INSTEAD:</strong></p>
<table>
<thead>
<tr>
<th>Scenario</th>
<th>Alternative</th>
<th>Why</th>
</tr>
</thead>
<tbody><tr>
<td>Context &gt; 32K tokens</td>
<td>Flash Attention 2/3</td>
<td>Rewrites attention kernel to avoid materializing the O(n²) score matrix; same mathematical output, O(n) peak VRAM</td>
</tr>
<tr>
<td>Many concurrent users at long context</td>
<td>Grouped Query Attention (GQA)</td>
<td>Shares K/V heads across query heads; reduces KV cache 4–8× with &lt;1% accuracy loss</td>
</tr>
<tr>
<td>Inference VRAM &lt; 4GB</td>
<td>State Space Models (Mamba)</td>
<td>O(n) memory via selective recurrence; no KV cache at all; competitive on many tasks</td>
</tr>
<tr>
<td>Document retrieval, not generation</td>
<td>Late interaction (ColBERT)</td>
<td>Per-token MaxSim scoring instead of full attention; retrieves better at lower compute</td>
</tr>
</tbody></table>
<p>The core trade-off is exact: attention provides O(1)-hop connection between any two tokens with O(n²) space. Every production modification either approximates that connection (GQA, local attention windows) or rewrites the computation graph to avoid materializing it (Flash Attention). The paper introduced the mechanism. Six years of engineering work has been spent making it practical at scale.</p>
<hr />
<h2>The Code</h2>
<p>Both artifacts are in <a href="https://github.com/yash-61016/attention-is-all-you-need"><code>attention-is-all-you-need/</code></a>.</p>
<p><strong>Artifact 1</strong> (<code>attention_from_scratch.py</code>): Implements scaled dot-product attention and multi-head attention in 200 lines of pure PyTorch with explicit QKV projections, verified against <code>F.scaled_dot_product_attention</code> (max diff 1.19e-07). Includes a VRAM scaling experiment across five sequence lengths demonstrating quadratic growth. Runs in ~2 minutes on an RTX 3090 or CPU.</p>
<p><strong>Artifact 2</strong> (<code>attention_head_analysis.py</code>): Loads GPT-2 small (117M parameters) and runs 100 varied English sentences through all 12 layers × 12 heads. Measures per-head Shannon entropy and diagonal locality score, classifies all 144 heads into four empirical types, and produces three charts: the 12×12 entropy heatmap, the layer-depth gradient showing specialisation increasing with depth, and one representative attention matrix per head type.</p>
<p>Hardware: RTX 3090 (24GB VRAM). Dependencies: PyTorch 2.1.0, Transformers 4.38.2, Matplotlib 3.7.1.</p>
]]></content:encoded></item><item><title><![CDATA[Beyond try/except: Architecting Robust Error Handling in Python Applications]]></title><description><![CDATA[As Python developers gain experience, the simple try...except block, while essential, often reveals its limitations in larger, more complex applications. We move from merely catching errors to needing a coherent strategy for managing them – one that ...]]></description><link>https://blogs.yashpatel.xyz/beyond-tryexcept-architecting-robust-error-handling-in-python-applications</link><guid isPermaLink="true">https://blogs.yashpatel.xyz/beyond-tryexcept-architecting-robust-error-handling-in-python-applications</guid><category><![CDATA[Testing]]></category><dc:creator><![CDATA[Yash Patel]]></dc:creator><pubDate>Wed, 23 Apr 2025 06:56:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745391218296/5cf168d5-852a-44df-9d65-99aa827af69d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As Python developers gain experience, the simple <code>try...except</code> block, while essential, often reveals its limitations in larger, more complex applications. We move from merely catching errors to needing a coherent strategy for managing them – one that enhances readability, simplifies debugging, and improves overall system resilience. Basic <code>try/except</code> blocks can quickly lead to tangled logic, obscured error origins, and difficulty in distinguishing between expected hiccups and genuine catastrophes.</p>
<p>This post delves into architectural patterns and advanced techniques for error handling in Python, targeting engineers looking to build more maintainable and resilient systems. We'll explore how thoughtful design, custom exceptions, alternative signalling patterns, strategic logging, and dedicated testing can transform error handling from a reactive chore into a proactive element of robust application architecture. Let's move beyond basic error catching and architect applications that don't just crash gracefully but handle failures intelligently.</p>
<h2 id="heading-introduction-the-limits-of-basic-exception-handling"><strong>Introduction: The Limits of Basic Exception Handling</strong></h2>
<p>The standard <code>try...except</code> mechanism is Python's cornerstone for handling runtime errors. It allows us to gracefully recover from unexpected situations. However, relying solely on generic <code>except Exception:</code> or scattering <code>try...except</code> blocks liberally throughout a codebase often leads to problems:</p>
<ol>
<li><p><strong>Loss of Specificity:</strong> Catching broad exceptions makes it hard to know <em>what</em> actually went wrong and react appropriately.</p>
</li>
<li><p><strong>Obscured Control Flow:</strong> Deeply nested <code>try...except</code> blocks can make code difficult to follow and reason about.</p>
</li>
<li><p><strong>Mixing Error Logic with Business Logic:</strong> Interspersing error handling frequently complicates the core functions or methods.</p>
</li>
<li><p><strong>Inconsistent Handling:</strong> Different parts of the application might handle similar errors in wildly different ways.</p>
</li>
</ol>
<p>To build scalable and maintainable applications, we need to elevate our error handling strategy.</p>
<h2 id="heading-designing-custom-exception-hierarchies-for-clarity"><strong>Designing Custom Exception Hierarchies for Clarity</strong></h2>
<p>Python's built-in exceptions (<code>ValueError</code>, <code>TypeError</code>, <code>FileNotFoundError</code>, etc.) are great, but they often lack application-specific context. Defining your own exception hierarchy provides semantic meaning and allows for more granular error handling.</p>
<p><strong>Why Create Custom Exceptions?</strong></p>
<ul>
<li><p><strong>Clarity:</strong> <code>UserServiceError</code> tells you much more than a generic <code>ValueError</code>.</p>
</li>
<li><p><strong>Targeted Handling:</strong> You can catch specific application-level errors (<code>except UserNotFoundError:</code>) separate from lower-level issues (<code>except DatabaseConnectionError:</code>).</p>
</li>
<li><p><strong>Encapsulation:</strong> Custom exceptions can carry additional context about the error (e.g., relevant IDs, failed parameters).</p>
</li>
</ul>
<p><strong>Designing the Hierarchy:</strong></p>
<p>Start with a base application exception and derive specific errors from it.</p>
<p><strong>Python</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># --- exceptions.py ---</span>
<span class="hljs-keyword">import</span> logging

logger = logging.getLogger(__name__)

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ApplicationError</span>(<span class="hljs-params">Exception</span>):</span>
    <span class="hljs-string">"""Base class for application-specific errors."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, message=<span class="hljs-string">"An application error occurred."</span>, original_exception=None, context=None</span>):</span>
        super().__init__(message)
        self.original_exception = original_exception
        self.context = context <span class="hljs-keyword">or</span> {}
        <span class="hljs-comment"># Log the error creation centrally if desired (can be noisy)</span>
        <span class="hljs-comment"># logger.error(f"{self.__class__.__name__}: {message}", exc_info=original_exception)</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DatabaseError</span>(<span class="hljs-params">ApplicationError</span>):</span>
    <span class="hljs-string">"""Errors related to database operations."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, message=<span class="hljs-string">"Database operation failed."</span>, original_exception=None, context=None</span>):</span>
        super().__init__(message, original_exception, context)

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ValidationError</span>(<span class="hljs-params">ApplicationError</span>):</span>
    <span class="hljs-string">"""Errors related to data validation."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, message=<span class="hljs-string">"Validation failed."</span>, field=None, value=None, context=None</span>):</span>
        super().__init__(message, context=context)
        self.field = field
        self.value = value
        <span class="hljs-keyword">if</span> field:
            self.context[<span class="hljs-string">'field'</span>] = field
        <span class="hljs-keyword">if</span> value <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
            self.context[<span class="hljs-string">'invalid_value'</span>] = value

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AuthenticationError</span>(<span class="hljs-params">ApplicationError</span>):</span>
    <span class="hljs-string">"""Errors related to user authentication or authorization."""</span>
    <span class="hljs-keyword">pass</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ExternalServiceError</span>(<span class="hljs-params">ApplicationError</span>):</span>
    <span class="hljs-string">"""Errors when communicating with external services."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, message=<span class="hljs-string">"External service communication failed."</span>, service_name=None, original_exception=None, context=None</span>):</span>
        super().__init__(message, original_exception, context)
        self.service_name = service_name
        <span class="hljs-keyword">if</span> service_name:
            self.context[<span class="hljs-string">'service_name'</span>] = service_name

<span class="hljs-comment"># --- Example Usage ---</span>
<span class="hljs-comment"># In your data access layer:</span>
<span class="hljs-comment"># try:</span>
<span class="hljs-comment">#     # db_operation(...)</span>
<span class="hljs-comment"># except SomeDbLibraryError as e:</span>
<span class="hljs-comment">#     raise DatabaseError(original_exception=e, context={'query': 'SELECT...'})</span>

<span class="hljs-comment"># In your input validation logic:</span>
<span class="hljs-comment"># if not is_valid_email(email):</span>
<span class="hljs-comment">#     raise ValidationError(field='email', value=email)</span>

<span class="hljs-comment"># In higher-level code:</span>
<span class="hljs-comment"># try:</span>
<span class="hljs-comment">#     process_user_request(data)</span>
<span class="hljs-comment"># except ValidationError as ve:</span>
<span class="hljs-comment">#     return api_error_response(f"Invalid input for {ve.field}", status_code=400)</span>
<span class="hljs-comment"># except DatabaseError as de:</span>
<span class="hljs-comment">#     logger.exception("Critical database error during user request.") # Log full trace</span>
<span class="hljs-comment">#     return api_error_response("Internal server error", status_code=500)</span>
<span class="hljs-comment"># except ApplicationError as ae:</span>
<span class="hljs-comment">#     logger.warning(f"Application error: {ae}", exc_info=ae.original_exception)</span>
<span class="hljs-comment">#     return api_error_response("An unexpected error occurred.", status_code=500)</span>
</code></pre>
<p>By catching <code>ApplicationError</code>, you can handle all your custom errors, while still allowing specific handling for <code>ValidationError</code> or <code>DatabaseError</code> where needed.</p>
<h2 id="heading-pattern-the-result-object-for-explicit-error-signalling"><strong>Pattern: The Result Object for Explicit Error Signalling</strong></h2>
<p>Not all "failures" are exceptional. Sometimes, an operation is expected to fail under certain conditions (e.g., user not found, validation rule violation). Raising exceptions for these expected outcomes can be cumbersome and performance-intensive if frequent. The <strong>Result</strong> pattern (inspired by functional programming concepts like Monads, particularly <code>Either</code> or <code>Result</code>) offers an alternative.</p>
<p>Instead of raising an exception, a function returns an object that explicitly represents either success (containing the value) or failure (containing error details).</p>
<p><strong>Benefits:</strong></p>
<ul>
<li><p><strong>Explicitness:</strong> The function signature makes it clear that it can fail in a controlled way.</p>
</li>
<li><p><strong>Clean Control Flow:</strong> Reduces the need for <code>try/except</code> blocks for <em>expected</em> failures.</p>
</li>
<li><p><strong>Type Safety (with type hints):</strong> Helps ensure callers handle both success and failure cases.</p>
</li>
</ul>
<p><strong>Simple Implementation:</strong></p>
<p><strong>Python</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># --- result.py ---</span>
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> TypeVar, Generic, Union, Optional, Any

T = TypeVar(<span class="hljs-string">'T'</span>) <span class="hljs-comment"># Success type</span>
E = TypeVar(<span class="hljs-string">'E'</span>) <span class="hljs-comment"># Error type</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Success</span>(<span class="hljs-params">Generic[T]</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, value: T</span>):</span>
        self._value = value

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">is_success</span>(<span class="hljs-params">self</span>) -&gt; bool:</span>
        <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">is_failure</span>(<span class="hljs-params">self</span>) -&gt; bool:</span>
        <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_value</span>(<span class="hljs-params">self</span>) -&gt; T:</span>
        <span class="hljs-keyword">return</span> self._value

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_error</span>(<span class="hljs-params">self</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"Cannot get error from Success"</span>)

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Failure</span>(<span class="hljs-params">Generic[E]</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, error: E</span>):</span>
        self._error = error

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">is_success</span>(<span class="hljs-params">self</span>) -&gt; bool:</span>
        <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">is_failure</span>(<span class="hljs-params">self</span>) -&gt; bool:</span>
        <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_value</span>(<span class="hljs-params">self</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"Cannot get value from Failure"</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_error</span>(<span class="hljs-params">self</span>) -&gt; E:</span>
        <span class="hljs-keyword">return</span> self._error

Result = Union[Success[T], Failure[E]]

<span class="hljs-comment"># --- Example Usage ---</span>
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Dict, Any

<span class="hljs-comment"># Assume ValidationError is defined as in the previous section</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">validate_user_data</span>(<span class="hljs-params">data: Dict[str, Any]</span>) -&gt; Result[Dict[str, Any], ValidationError]:</span>
    email = data.get(<span class="hljs-string">'email'</span>)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> email <span class="hljs-keyword">or</span> <span class="hljs-string">'@'</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> email:
        <span class="hljs-comment"># Return Failure for an *expected* validation issue</span>
        <span class="hljs-keyword">return</span> Failure(ValidationError(field=<span class="hljs-string">'email'</span>, value=email, message=<span class="hljs-string">"Invalid email format"</span>))

    <span class="hljs-comment"># ... other validations ...</span>

    <span class="hljs-comment"># Return Success if valid</span>
    <span class="hljs-keyword">return</span> Success(data) <span class="hljs-comment"># Or perhaps return a validated User object</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_registration</span>(<span class="hljs-params">data: Dict[str, Any]</span>):</span>
    validation_result = validate_user_data(data)

    <span class="hljs-keyword">if</span> validation_result.is_failure():
        error = validation_result.get_error()
        print(<span class="hljs-string">f"Registration failed validation: <span class="hljs-subst">{error}</span> (Field: <span class="hljs-subst">{error.field}</span>)"</span>)
        <span class="hljs-comment"># Return an appropriate response or re-raise if truly exceptional here</span>
        <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"error"</span>, <span class="hljs-string">"message"</span>: <span class="hljs-string">f"Validation failed: <span class="hljs-subst">{error.message}</span>"</span>}

    <span class="hljs-comment"># If success, proceed with validated data</span>
    validated_data = validation_result.get_value()
    print(<span class="hljs-string">"Validation successful, proceeding with registration..."</span>)
    <span class="hljs-comment"># ... create user in database (this might raise a DatabaseError exception) ...</span>
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"success"</span>, <span class="hljs-string">"user_id"</span>: <span class="hljs-number">123</span>}

<span class="hljs-comment"># Calling the function</span>
user_data_invalid = {<span class="hljs-string">"name"</span>: <span class="hljs-string">"Test"</span>}
process_registration(user_data_invalid)

user_data_valid = {<span class="hljs-string">"name"</span>: <span class="hljs-string">"Test"</span>, <span class="hljs-string">"email"</span>: <span class="hljs-string">"test@example.com"</span>}
process_registration(user_data_valid)
</code></pre>
<p>Libraries like <code>returns</code> or <code>results</code> provide more sophisticated implementations of this pattern. Use Results for predictable failure paths and Exceptions for truly unexpected or system-level errors.</p>
<h2 id="heading-centralized-vs-localized-error-handling-making-the-right-choice"><strong>Centralized vs. Localized Error Handling: Making the Right Choice</strong></h2>
<p>Where should you handle errors?</p>
<ul>
<li><p><strong>Localized Handling:</strong> Catch and handle errors immediately where they occur.</p>
<ul>
<li><p><strong>Pros:</strong> Simple for straightforward cases, keeps handling logic close to the source.</p>
</li>
<li><p><strong>Cons:</strong> Can lead to repetition, mixes error logic with business logic, difficult to enforce consistency.</p>
</li>
<li><p><strong>Best for:</strong> Recoverable errors where the immediate context has all the information needed to proceed or compensate (e.g., retrying a network call, falling back to a default value).</p>
</li>
</ul>
</li>
<li><p><strong>Centralized Handling:</strong> Allow exceptions to propagate up the call stack and handle them at specific boundaries (e.g., API endpoint decorators, middleware, main application loop).</p>
<ul>
<li><p><strong>Pros:</strong> Enforces consistency (logging, error responses), separates concerns, simplifies core business logic.</p>
</li>
<li><p><strong>Cons:</strong> Can sometimes lose specific context if not propagated correctly (custom exceptions help here!), might require more setup (e.g., framework middleware).</p>
</li>
<li><p><strong>Best for:</strong> Handling errors that terminate a request/response cycle (web apps), performing consistent logging/alerting, converting exceptions to user-friendly messages or standard error formats (e.g., JSON API error responses).</p>
</li>
</ul>
</li>
</ul>
<p><strong>Example (Web Framework Middleware/Decorator):</strong></p>
<p><strong>Python</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Using Flask as an example (similar concepts apply to Django, FastAPI, etc.)</span>
<span class="hljs-keyword">from</span> flask <span class="hljs-keyword">import</span> Flask, jsonify, request
<span class="hljs-keyword">from</span> exceptions <span class="hljs-keyword">import</span> ApplicationError, ValidationError, DatabaseError <span class="hljs-comment"># Our custom exceptions</span>
<span class="hljs-keyword">import</span> logging

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

<span class="hljs-meta">@app.errorhandler(ValidationError)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handle_validation_error</span>(<span class="hljs-params">error: ValidationError</span>):</span>
    logging.warning(<span class="hljs-string">f"Validation failed: <span class="hljs-subst">{error.message}</span> for field '<span class="hljs-subst">{error.field}</span>'"</span>)
    response = {
        <span class="hljs-string">"error"</span>: <span class="hljs-string">"VALIDATION_ERROR"</span>,
        <span class="hljs-string">"message"</span>: error.message,
        <span class="hljs-string">"details"</span>: error.context
    }
    <span class="hljs-keyword">return</span> jsonify(response), <span class="hljs-number">400</span> <span class="hljs-comment"># Bad Request</span>

<span class="hljs-meta">@app.errorhandler(DatabaseError)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handle_database_error</span>(<span class="hljs-params">error: DatabaseError</span>):</span>
    <span class="hljs-comment"># Log the full exception trace for internal errors</span>
    logging.exception(<span class="hljs-string">f"Database error occurred processing <span class="hljs-subst">{request.path}</span>"</span>)
    response = {
        <span class="hljs-string">"error"</span>: <span class="hljs-string">"INTERNAL_SERVER_ERROR"</span>,
        <span class="hljs-string">"message"</span>: <span class="hljs-string">"A database error occurred. Please try again later."</span>
    }
    <span class="hljs-keyword">return</span> jsonify(response), <span class="hljs-number">500</span> <span class="hljs-comment"># Internal Server Error</span>

<span class="hljs-meta">@app.errorhandler(ApplicationError)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handle_application_error</span>(<span class="hljs-params">error: ApplicationError</span>):</span>
    <span class="hljs-comment"># Catch-all for other specific application errors</span>
    logging.error(<span class="hljs-string">f"Application error occurred: <span class="hljs-subst">{error}</span>"</span>, exc_info=error.original_exception)
    response = {
        <span class="hljs-string">"error"</span>: <span class="hljs-string">"APPLICATION_ERROR"</span>,
        <span class="hljs-string">"message"</span>: str(error) <span class="hljs-keyword">or</span> <span class="hljs-string">"An unexpected application error occurred."</span>
    }
    <span class="hljs-keyword">return</span> jsonify(response), <span class="hljs-number">500</span> <span class="hljs-comment"># Internal Server Error</span>

<span class="hljs-meta">@app.errorhandler(Exception)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handle_generic_exception</span>(<span class="hljs-params">error: Exception</span>):</span>
    <span class="hljs-comment"># Catch any unexpected Python errors not caught by specific handlers</span>
    logging.exception(<span class="hljs-string">f"Unhandled exception occurred processing <span class="hljs-subst">{request.path}</span>"</span>)
    response = {
        <span class="hljs-string">"error"</span>: <span class="hljs-string">"UNEXPECTED_ERROR"</span>,
        <span class="hljs-string">"message"</span>: <span class="hljs-string">"An unexpected server error occurred."</span>
    }
    <span class="hljs-keyword">return</span> jsonify(response), <span class="hljs-number">500</span>

<span class="hljs-meta">@app.route('/users', methods=['POST'])</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_user</span>():</span>
    <span class="hljs-comment"># ... get data from request ...</span>
    data = request.json
    <span class="hljs-comment"># Assume process_registration raises ValidationError or DatabaseError on failure</span>
    <span class="hljs-comment"># It doesn't need try/except blocks internally for these specific errors anymore</span>
    <span class="hljs-comment"># because the error handlers above will catch them.</span>
    <span class="hljs-comment"># result = process_registration(data) # Function might still raise exceptions</span>
    <span class="hljs-comment"># Mocking potential errors for demonstration</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> data.get(<span class="hljs-string">'email'</span>):
         <span class="hljs-keyword">raise</span> ValidationError(field=<span class="hljs-string">'email'</span>, message=<span class="hljs-string">'Email is required'</span>)
    <span class="hljs-keyword">if</span> data.get(<span class="hljs-string">'trigger_db_error'</span>):
         <span class="hljs-keyword">raise</span> DatabaseError(<span class="hljs-string">"Failed to connect to user DB"</span>)

    <span class="hljs-keyword">return</span> jsonify({<span class="hljs-string">"status"</span>: <span class="hljs-string">"success"</span>, <span class="hljs-string">"user_id"</span>: <span class="hljs-number">42</span>}), <span class="hljs-number">201</span>

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    app.run(debug=<span class="hljs-literal">True</span>) <span class="hljs-comment"># Debug mode shows interactive traceback; test handlers with debug=False</span>
</code></pre>
<p>Often, a combination is best: handle specific, recoverable errors locally, and let broader or request-terminating errors propagate to a centralized handler.</p>
<h2 id="heading-effective-logging-strategies-tied-to-exceptions"><strong>Effective Logging Strategies Tied to Exceptions</strong></h2>
<p>Logging is crucial for understanding errors after they occur. Integrate it tightly with your exception handling strategy.</p>
<ul>
<li><p><strong>Log at the Right Level:</strong></p>
<ul>
<li><p><code>logging.ERROR</code> or <code>logging.CRITICAL</code>: For unhandled exceptions or serious failures caught in centralized handlers. Include stack traces (<code>exc_info=True</code>).</p>
</li>
<li><p><code>logging.WARNING</code>: For handled exceptions that indicate a potential problem or an expected but notable failure (e.g., validation errors, external service timeouts if handled gracefully).</p>
</li>
<li><p><code>logging.INFO</code>: For significant application lifecycle events, not typically for errors themselves.</p>
</li>
<li><p><code>logging.DEBUG</code>: For detailed information useful only during development/debugging.</p>
</li>
</ul>
</li>
<li><p><strong>Include Context:</strong> Log relevant information like user IDs, request IDs, input parameters (be careful with sensitive data!), and custom exception attributes. Centralized handlers and custom exception <code>context</code> attributes are great for this.</p>
</li>
<li><p><strong>Use Structured Logging:</strong> Log messages in formats like JSON. This makes logs much easier to parse, filter, and analyze with log aggregation tools (e.g., ELK stack, Datadog, Splunk).</p>
</li>
</ul>
<p><strong>Python</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> logging
<span class="hljs-keyword">import</span> json

<span class="hljs-comment"># Configure structured logging (basic example)</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">JsonFormatter</span>(<span class="hljs-params">logging.Formatter</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">format</span>(<span class="hljs-params">self, record</span>):</span>
        log_record = {
            <span class="hljs-string">"timestamp"</span>: self.formatTime(record, self.datefmt),
            <span class="hljs-string">"level"</span>: record.levelname,
            <span class="hljs-string">"message"</span>: record.getMessage(),
            <span class="hljs-string">"logger_name"</span>: record.name,
        }
        <span class="hljs-keyword">if</span> record.exc_info:
            <span class="hljs-comment"># Add traceback if available</span>
            log_record[<span class="hljs-string">'exception'</span>] = self.formatException(record.exc_info)
        <span class="hljs-keyword">if</span> hasattr(record, <span class="hljs-string">'context'</span>): <span class="hljs-comment"># Add custom context if provided</span>
             log_record.update(record.context)
        <span class="hljs-keyword">return</span> json.dumps(log_record)

handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())

logging.basicConfig(level=logging.INFO, handlers=[handler])
logger = logging.getLogger(__name__)

<span class="hljs-comment"># --- Example Usage ---</span>
<span class="hljs-keyword">from</span> exceptions <span class="hljs-keyword">import</span> DatabaseError <span class="hljs-comment"># Our custom exception</span>

request_id = <span class="hljs-string">"req-123abc"</span>
user_id = <span class="hljs-number">42</span>

<span class="hljs-keyword">try</span>:
    <span class="hljs-comment"># Simulate a database operation failing</span>
    <span class="hljs-keyword">raise</span> ConnectionError(<span class="hljs-string">"DB connection timeout"</span>)
<span class="hljs-keyword">except</span> ConnectionError <span class="hljs-keyword">as</span> e:
    <span class="hljs-comment"># Wrap the original exception and add context</span>
    db_error = DatabaseError(
        original_exception=e,
        context={<span class="hljs-string">'query_details'</span>: <span class="hljs-string">'UPDATE users SET ...'</span>, <span class="hljs-string">'user_id'</span>: user_id, <span class="hljs-string">'request_id'</span>: request_id}
    )
    <span class="hljs-comment"># Log with ERROR level, exc_info, and custom context</span>
    logger.error(
        <span class="hljs-string">f"Database operation failed for user <span class="hljs-subst">{user_id}</span>"</span>,
        exc_info=db_error,
        extra={<span class="hljs-string">'context'</span>: db_error.context} <span class="hljs-comment"># Pass context to logger</span>
    )
    <span class="hljs-comment"># Re-raise or handle as appropriate</span>
    <span class="hljs-comment"># raise db_error</span>
</code></pre>
<h2 id="heading-testing-your-error-paths-ensuring-resilience"><strong>Testing Your Error Paths: Ensuring Resilience</strong></h2>
<p>Your error handling code is code too, and it needs testing!</p>
<ul>
<li><p><strong>Test Exception Raising:</strong> Use <code>pytest.raises</code> to assert that specific functions raise the expected custom exceptions under failure conditions.</p>
</li>
<li><p><strong>Test Exception Handling:</strong> In integration or end-to-end tests, simulate failure conditions (e.g., mock a database call to raise an error) and verify that your centralized handlers produce the correct output (e.g., the right HTTP status code and error JSON).</p>
</li>
<li><p><strong>Test Result Objects:</strong> If using the Result pattern, write tests that check both the <code>Success</code> and <code>Failure</code> return paths, verifying the contained value or error object.</p>
</li>
</ul>
<p><strong>Python</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pytest
<span class="hljs-keyword">from</span> exceptions <span class="hljs-keyword">import</span> ValidationError, DatabaseError
<span class="hljs-keyword">from</span> result <span class="hljs-keyword">import</span> Success, Failure <span class="hljs-comment"># Assuming result.py from earlier</span>

<span class="hljs-comment"># --- Functions to test (simplified examples) ---</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">validate_email</span>(<span class="hljs-params">email: str | None</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> email <span class="hljs-keyword">or</span> <span class="hljs-string">'@'</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> email:
        <span class="hljs-keyword">raise</span> ValidationError(field=<span class="hljs-string">'email'</span>, value=email)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">might_fail_db</span>() -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-comment"># Simulate a potential failure</span>
    <span class="hljs-keyword">raise</span> DatabaseError(<span class="hljs-string">"Connection failed"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_data_result</span>(<span class="hljs-params">data: dict</span>) -&gt; Result[str, str]:</span>
     <span class="hljs-keyword">if</span> data.get(<span class="hljs-string">"valid"</span>):
         <span class="hljs-keyword">return</span> Success(<span class="hljs-string">"Processed successfully"</span>)
     <span class="hljs-keyword">else</span>:
         <span class="hljs-keyword">return</span> Failure(<span class="hljs-string">"Invalid data provided"</span>)

<span class="hljs-comment"># --- Tests using pytest ---</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_validate_email_raises_validation_error</span>():</span>
    <span class="hljs-keyword">with</span> pytest.raises(ValidationError) <span class="hljs-keyword">as</span> excinfo:
        validate_email(<span class="hljs-string">"invalid-email"</span>)
    <span class="hljs-keyword">assert</span> excinfo.value.field == <span class="hljs-string">'email'</span>
    <span class="hljs-keyword">assert</span> excinfo.value.value == <span class="hljs-string">"invalid-email"</span>

    <span class="hljs-keyword">with</span> pytest.raises(ValidationError):
        validate_email(<span class="hljs-literal">None</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_validate_email_success</span>():</span>
    <span class="hljs-keyword">try</span>:
        validate_email(<span class="hljs-string">"test@example.com"</span>)
    <span class="hljs-keyword">except</span> ValidationError:
        pytest.fail(<span class="hljs-string">"ValidationError raised unexpectedly"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_db_function_raises_database_error</span>():</span>
    <span class="hljs-keyword">with</span> pytest.raises(DatabaseError):
        might_fail_db()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_process_data_result_success</span>():</span>
    result = process_data_result({<span class="hljs-string">"valid"</span>: <span class="hljs-literal">True</span>})
    <span class="hljs-keyword">assert</span> result.is_success()
    <span class="hljs-keyword">assert</span> <span class="hljs-keyword">not</span> result.is_failure()
    <span class="hljs-keyword">assert</span> result.get_value() == <span class="hljs-string">"Processed successfully"</span>
    <span class="hljs-keyword">with</span> pytest.raises(ValueError): <span class="hljs-comment"># Cannot get error from Success</span>
        result.get_error()


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_process_data_result_failure</span>():</span>
    result = process_data_result({<span class="hljs-string">"valid"</span>: <span class="hljs-literal">False</span>})
    <span class="hljs-keyword">assert</span> <span class="hljs-keyword">not</span> result.is_success()
    <span class="hljs-keyword">assert</span> result.is_failure()
    <span class="hljs-keyword">assert</span> result.get_error() == <span class="hljs-string">"Invalid data provided"</span>
    <span class="hljs-keyword">with</span> pytest.raises(ValueError): <span class="hljs-comment"># Cannot get value from Failure</span>
        result.get_value()

<span class="hljs-comment"># For testing centralized handlers (e.g., Flask app):</span>
<span class="hljs-comment"># Use the test client provided by the framework</span>
<span class="hljs-comment"># def test_api_validation_error(client): # client is a pytest fixture for the Flask test client</span>
<span class="hljs-comment">#     response = client.post('/users', json={'name': 'Test'}) # Missing email</span>
<span class="hljs-comment">#     assert response.status_code == 400</span>
<span class="hljs-comment">#     assert response.json['error'] == 'VALIDATION_ERROR'</span>
<span class="hljs-comment">#     assert 'email' in response.json['details'].get('field', '')</span>
</code></pre>
<h2 id="heading-conclusion-error-handling-as-a-core-architectural-concern"><strong>Conclusion: Error Handling as a Core Architectural Concern</strong></h2>
<p>Robust error handling is not an afterthought; it's a cornerstone of reliable, maintainable software. By moving beyond basic <code>try/except</code> blocks and adopting more structured approaches, we can significantly improve our Python applications.</p>
<ul>
<li><p><strong>Design custom exception hierarchies</strong> for semantic clarity and targeted handling.</p>
</li>
<li><p><strong>Consider the Result pattern</strong> for explicit signaling of expected, non-exceptional failures.</p>
</li>
<li><p><strong>Strategically choose between localized and centralized handling</strong> to balance simplicity and consistency.</p>
</li>
<li><p><strong>Integrate logging deeply</strong> with your error handling, providing context and structure.</p>
</li>
<li><p><strong>Rigorously test your error paths</strong> to ensure your safety nets actually work.</p>
</li>
</ul>
<p>By incorporating these techniques, you shift from simply reacting to errors to proactively architecting for resilience. This investment pays dividends in easier debugging, more stable applications, and a more pleasant development experience for you and your team.</p>
]]></content:encoded></item><item><title><![CDATA[Demystifying Reinforcement Learning: A Beginner's Guide to the Math]]></title><description><![CDATA[Introduction
Imagine teaching a computer to play chess from scratch. How would it learn which moves lead to checkmate and which lead to defeat? How would it understand the long-term consequences of capturing a pawn versus protecting its queen? This i...]]></description><link>https://blogs.yashpatel.xyz/demystifying-reinforcement-learning-a-beginners-guide-to-the-math</link><guid isPermaLink="true">https://blogs.yashpatel.xyz/demystifying-reinforcement-learning-a-beginners-guide-to-the-math</guid><dc:creator><![CDATA[Yash Patel]]></dc:creator><pubDate>Tue, 04 Mar 2025 00:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1742818943874/9704e3e3-e8db-4c70-96c2-e61f0c17982d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Imagine teaching a computer to play chess from scratch. How would it learn which moves lead to checkmate and which lead to defeat? How would it understand the long-term consequences of capturing a pawn versus protecting its queen? This is the essence of reinforcement learning (RL), where agents learn optimal decision-making through interaction with their environment.</p>
<p>In this comprehensive blog, we'll explore Chapter 3 of Sutton and Barto's "Reinforcement Learning: An Introduction," which lays the mathematical foundation for understanding how RL agents learn. Whether you're a computer science student or an AI enthusiast, this guide will help you grasp these concepts through detailed explanations and our running chess analogy.</p>
<h2 id="heading-1-the-agent-environment-interface">1. The Agent-Environment Interface</h2>
<p>At the heart of reinforcement learning is the interaction between an agent (our chess-playing AI) and its environment (the chess game). This interaction happens in discrete time steps and follows a specific pattern:</p>
<ol>
<li><p>The agent observes the current state (S₁) - in chess, this is the current board position</p>
</li>
<li><p>Based on this state, the agent selects an action (A₁) - a specific chess move</p>
</li>
<li><p>The environment responds with:</p>
<ul>
<li><p>A new state (S₂) - the new board position after both players move</p>
</li>
<li><p>A reward (R₂) - perhaps +1 for capturing a piece, -1 for losing one, or +100 for checkmate</p>
</li>
</ul>
</li>
</ol>
<p>This cycle continues, creating a sequence: S₁, A₁, R₂, S₂, A₂, R₃, S₃, ... and so on.</p>
<h3 id="heading-the-mathematical-framework-markov-decision-processes-mdps">The Mathematical Framework: Markov Decision Processes (MDPs)</h3>
<p>This interaction is formalized as a Markov Decision Process (MDP), which has five key components:</p>
<ol>
<li><p>A set of states (S) - all possible chess board configurations</p>
</li>
<li><p>A set of actions (A) - all legal chess moves</p>
</li>
<li><p>A set of rewards (R) - numerical feedback values</p>
</li>
<li><p>State transition probabilities - how likely each new state is, given the current state and action</p>
</li>
<li><p>Reward probabilities - how likely each reward is, given the state and action</p>
</li>
</ol>
<p>In a finite MDP, these sets contain a finite number of elements, making the problem mathematically tractable.</p>
<h3 id="heading-the-dynamics-function">The Dynamics Function</h3>
<p>The environment's behaviour is completely described by the dynamics function:</p>
<p>$$p(s', r | s, a) = \Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}$$</p><p>This function gives the probability of transitioning to state s' and receiving reward r, given that we were in state s and took action a.</p>
<p>Let's break this down with our chess example:</p>
<ul>
<li><p>s: Current board position with your knight threatened by an opponent's pawn</p>
</li>
<li><p>a: Move your knight to safety</p>
</li>
<li><p>s': New board position after your move and your opponent's response</p>
</li>
<li><p>r: Reward (perhaps +0.5 for saving your piece)</p>
</li>
</ul>
<p>The dynamics function tells us the probability of ending up in state s' with reward r after taking action a in state s.</p>
<h3 id="heading-the-markov-property">The Markov Property</h3>
<p>A critical aspect of MDPs is the Markov property, which states that the future depends only on the present state and action, not on the history of how we got there. In chess terms, the best move depends only on the current board position, not on the sequence of moves that created it.</p>
<p>This might seem counterintuitive at first - doesn't chess strategy depend on understanding your opponent's past moves? The key insight is that if the "state" is properly defined to include all relevant information (perhaps including a model of your opponent's strategy based on their past moves), then the Markov property holds.</p>
<h3 id="heading-derived-quantities-from-the-dynamics-function">Derived Quantities from the Dynamics Function</h3>
<p>From the dynamics function, we can derive other useful quantities:</p>
<ol>
<li>State-transition probabilities:</li>
</ol>
<p>$$p(s' | s, a) = \sum_r p(s', r | s, a)$$</p><ol start="2">
<li>Expected rewards for state-action pairs:</li>
</ol>
<p>$$r(s, a) = \sum_r r \sum_{s'} p(s', r | s, a)$$</p><ol start="3">
<li>Expected rewards for state-action-next-state triples:</li>
</ol>
<p>$$r(s, a, s') = \sum_r r \left[ \frac{p(s', r | s, a)}{p(s' | s, a)} \right]$$</p><h3 id="heading-the-agent-environment-boundary">The Agent-Environment Boundary</h3>
<p>An important conceptual point is where to draw the line between agent and environment. In our chess example, the agent is the decision-making algorithm, while the environment includes the chess board, pieces, rules, and opponent.</p>
<p>The general principle is: anything the agent cannot arbitrarily change is part of the environment. The agent's knowledge about the environment is separate from this boundary - an agent might know everything about chess rules but still face a challenging learning problem.</p>
<h2 id="heading-2-goals-and-rewards">2. Goals and Rewards</h2>
<p>In reinforcement learning, the agent's goal is formalized through rewards - numerical values that the agent receives after each action. The fundamental principle is the reward hypothesis:</p>
<blockquote>
<p>All goals can be described as the maximization of expected cumulative reward.</p>
</blockquote>
<h3 id="heading-designing-reward-signals">Designing Reward Signals</h3>
<p>Designing an effective reward signal is crucial but tricky. For our chess AI:</p>
<ul>
<li><p>Too sparse: Only +1 for winning, -1 for losing, 0 otherwise. This makes learning slow because feedback is delayed until the end of a long game.</p>
</li>
<li><p>Too frequent/misleading: +1 for each piece captured might encourage the AI to sacrifice important pieces to capture pawns.</p>
</li>
<li><p>Well-designed: Perhaps +0.1 for controlling center squares, +0.5 for checking the opponent, +1 for capturing pieces (weighted by value), and +100 for checkmate.</p>
</li>
</ul>
<p>The key principle is that rewards communicate WHAT to achieve, not HOW to achieve it. If we reward subgoals too strongly, the agent might optimize for those at the expense of the true goal.</p>
<p>For example, if we heavily reward capturing pieces, our chess AI might sacrifice positional advantage just to capture a pawn. Instead, the reward should reflect the true objective - winning the game - with smaller rewards for actions that generally contribute to that goal.</p>
<h3 id="heading-examples-of-reward-design">Examples of Reward Design</h3>
<ol>
<li><p><strong>Chess AI</strong>: +100 for checkmate, -100 for being checkmated, +piece_value for captures, -piece_value for losses, small rewards for controlling important squares.</p>
</li>
<li><p><strong>Robot Navigation</strong>: -1 for each time step (to encourage efficiency), -10 for collisions, +100 for reaching the goal.</p>
</li>
<li><p><strong>Stock Trading Agent</strong>: Reward proportional to portfolio value increase, with perhaps small penalties for excessive trading (to discourage churn).</p>
</li>
</ol>
<p>The art of reward design involves balancing immediate feedback (to guide learning) with alignment to the true objective (to ensure the right behavior is learned).</p>
<h2 id="heading-3-returns-and-episodes">3. Returns and Episodes</h2>
<p>Now that we've defined states, actions, and rewards, we need to formalize the agent's objective: maximizing the expected return. The return is the function of future rewards that the agent aims to maximize.</p>
<h3 id="heading-episodic-tasks">Episodic Tasks</h3>
<p>In episodic tasks, the agent-environment interaction naturally breaks into distinct episodes with clear endpoints. A chess game is a perfect example - each game starts from the initial board position and ends with checkmate, stalemate, or resignation.</p>
<p>For episodic tasks, we define the return as the sum of rewards from the current time step until the end of the episode:</p>
<p>$$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T$$</p><p>Where T is the final time step of the episode.</p>
<p>For our chess AI, this would be the sum of all rewards received from the current position until the game ends.</p>
<h3 id="heading-continuing-tasks">Continuing Tasks</h3>
<p>In continuing tasks, the interaction continues without a natural endpoint. Examples include ongoing process control or a robot that operates continuously.</p>
<p>For these tasks, summing rewards could lead to infinite returns, making it impossible to compare policies. To solve this, we introduce discounting:</p>
<p>$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$</p><p>Where γ is the discount rate (0 ≤ γ ≤ 1).</p>
<h3 id="heading-the-discount-rate">The Discount Rate</h3>
<p>The discount rate γ determines how much the agent values future rewards:</p>
<ul>
<li><p>γ = 0: "Myopic" agent that only cares about immediate rewards</p>
</li>
<li><p>γ close to 0: Agent values near-term rewards much more than long-term rewards</p>
</li>
<li><p>γ close to 1: Agent values future rewards almost as much as immediate ones</p>
</li>
</ul>
<p>In chess, a low γ might lead to an agent that greedily captures pieces without considering long-term positional disadvantages. A high γ would encourage strategic play, where the agent might sacrifice material for long-term advantage.</p>
<h3 id="heading-recursive-relationship-of-returns">Recursive Relationship of Returns</h3>
<p>Returns at successive time steps are related recursively:</p>
<p>$$G_t = R_{t+1} + \gamma G_{t+1}$$</p><p>This recursive relationship is fundamental to many RL algorithms and helps us understand how value propagates backward from future states to current states.</p>
<h2 id="heading-4-unified-notation-for-episodic-and-continuing-tasks">4. Unified Notation for Episodic and Continuing Tasks</h2>
<p>To handle both episodic and continuing tasks with a single notation, we use a clever mathematical trick: we treat episode termination as entering a special absorbing state that transitions only to itself and generates only rewards of zero.</p>
<p>This allows us to use the discounted return formula for both types of tasks:</p>
<p>$$G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$</p><p>For episodic tasks, we can either:</p>
<ol>
<li><p>Set γ = 1 and rely on the episode's finite length</p>
</li>
<li><p>Use γ &lt; 1 and treat the terminal state as absorbing with zero rewards</p>
</li>
</ol>
<p>This unified notation simplifies our mathematical framework and allows us to develop algorithms that work for both types of tasks.</p>
<h2 id="heading-5-policies-and-value-functions">5. Policies and Value Functions</h2>
<h3 id="heading-policies">Policies</h3>
<p>A policy (π) defines the agent's behavior - it's the strategy that maps states to actions. In reinforcement learning, we typically use stochastic policies, where π(a|s) gives the probability of taking action a in state s.</p>
<p>For our chess AI, the policy would specify the probability of making each possible move in any given board position. A deterministic policy would always choose the same move in the same position, while a stochastic policy might sometimes explore different options.</p>
<h3 id="heading-value-functions">Value Functions</h3>
<p>Value functions estimate how good it is for the agent to be in a given state (or to take a given action in a given state). They're defined in terms of expected future returns.</p>
<h3 id="heading-state-value-function">State-Value Function</h3>
<p>The state-value function for policy π is defined as:</p>
<p>$$v_\pi(s) = \mathbb{E}\pi[G_t | S_t = s] = \mathbb{E}\pi\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s\right]$$</p><p>This gives the expected return when starting in state s and following policy π thereafter.</p>
<p>In chess terms, vπ(s) would tell us the expected outcome (in terms of our reward metric) when starting from board position s and playing according to strategy π.</p>
<h3 id="heading-action-value-function">Action-Value Function</h3>
<p>The action-value function for policy π is defined as:</p>
<p>$$q_\pi(s, a) = \mathbb{E}\pi[G_t | S_t = s, A_t = a] = \mathbb{E}\pi\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s, A_t = a\right]$$</p><p>This gives the expected return when starting in state s, taking action a, and thereafter following policy π.</p>
<p>In chess, qπ(s, a) would tell us the expected outcome when making a specific move a from position s, and then continuing with strategy π.</p>
<h3 id="heading-the-bellman-equation">The Bellman Equation</h3>
<p>A fundamental property of value functions is that they satisfy recursive relationships known as Bellman equations. For the state-value function:</p>
<p>$$v_\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r|s, a)[r + \gamma v_\pi(s')]$$</p><p>This equation expresses a relationship between the value of a state and the values of its successor states. It's saying that the value of the current state equals the expected immediate reward plus the discounted value of the next state.</p>
<p>In chess terms, the value of a position equals the immediate benefit of your next move (capturing a piece, controlling the center, etc.) plus the discounted value of the resulting position.</p>
<h3 id="heading-backup-diagrams">Backup Diagrams</h3>
<p>Backup diagrams visually represent these recursive relationships. They show how value information "backs up" from future states to the current state.</p>
<p>For the state-value function, the backup diagram shows:</p>
<ol>
<li><p>The current state at the top</p>
</li>
<li><p>Actions available from that state</p>
</li>
<li><p>Possible next states resulting from each action</p>
</li>
<li><p>The value backing up from those next states to the current state</p>
</li>
</ol>
<p>These diagrams help visualise the flow of value information in reinforcement learning algorithms.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742819292558/4961a13d-02f1-43e8-862d-5ef49ff7c93b.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-6-optimal-policies-and-optimal-value-functions">6. Optimal Policies and Optimal Value Functions</h2>
<p>The goal in reinforcement learning is to find an optimal policy - one that achieves the maximum expected return from all states. This leads us to define optimal value functions.</p>
<h3 id="heading-optimal-state-value-function">Optimal State-Value Function</h3>
<p>The optimal state-value function, v*(s), gives the maximum value achievable by any policy for each state s:</p>
<p>$$v_*(s) = \max_\pi v_\pi(s)$$</p><p>In chess, v*(s) would tell us the expected outcome when playing optimally from position s.</p>
<h3 id="heading-optimal-action-value-function">Optimal Action-Value Function</h3>
<p>Similarly, the optimal action-value function, q*(s, a), gives the maximum expected return for each state-action pair:</p>
<p>$$q_*(s, a) = \max_\pi q_\pi(s, a)$$</p><p>In chess, q*(s, a) would tell us the expected outcome when making move a from position s and then playing optimally thereafter.</p>
<h3 id="heading-bellman-optimality-equations">Bellman Optimality Equations</h3>
<p>The optimal value functions satisfy special Bellman equations called Bellman optimality equations:</p>
<p>$$v_(s) = \max_a \sum_{s', r} p(s', r|s, a)[r + \gamma v_(s')]$$</p><p>$$q_(s, a) = \sum_{s', r} p(s', r|s, a)\left[r + \gamma \max_{a'} q_(s', a')\right]$$</p><p>These equations express the principle that the value of a state under an optimal policy equals the expected return for the best action from that state.</p>
<h3 id="heading-finding-optimal-policies">Finding Optimal Policies</h3>
<p>Once we have the optimal value functions, finding an optimal policy is straightforward:</p>
<ul>
<li><p>With v*: Choose actions that maximize the expected value of the next state</p>
</li>
<li><p>With q*: Simply choose the action with the highest q*(s, a) in each state</p>
</li>
</ul>
<p>This is why q* is particularly useful - it directly tells us which actions are best in each state without requiring additional computation.</p>
<h3 id="heading-greedy-policies">Greedy Policies</h3>
<p>A policy that always selects the action with the highest estimated value is called a greedy policy. When we have the true optimal value function v*, a greedy policy with respect to v* is guaranteed to be optimal.</p>
<p>In chess, this would mean always making the move that leads to the position with the highest v* value.</p>
<h2 id="heading-7-optimality-and-approximation">7. Optimality and Approximation</h2>
<p>While we've defined optimal policies and value functions mathematically, finding them exactly is often computationally infeasible for real-world problems.</p>
<h3 id="heading-computational-challenges">Computational Challenges</h3>
<p>For many interesting problems, the state space is enormous:</p>
<ul>
<li><p>Chess has approximately 10^43 legal positions</p>
</li>
<li><p>Go has approximately 10^170</p>
</li>
<li><p>Real-world robotics problems have continuous state spaces with infinite states</p>
</li>
</ul>
<p>Even with today's computing power, we cannot solve the Bellman optimality equations exactly for such large problems.</p>
<h3 id="heading-the-role-of-approximation">The Role of Approximation</h3>
<p>In practice, reinforcement learning relies on approximation methods:</p>
<ol>
<li><p>Function approximation: Using parameterized functions (like neural networks) to represent value functions</p>
</li>
<li><p>Sample-based learning: Learning from experience rather than complete knowledge of the environment</p>
</li>
<li><p>Temporal difference learning: Bootstrapping estimates from other estimates</p>
</li>
</ol>
<h3 id="heading-focusing-on-important-states">Focusing on Important States</h3>
<p>A key advantage of reinforcement learning is that it can focus computational resources on states that the agent actually encounters, rather than trying to learn optimal behavior for all possible states.</p>
<p>In chess, professional players don't memorize optimal moves for all 10^43 positions - they focus on positions that commonly arise in games. Similarly, reinforcement learning algorithms can prioritize learning about frequently encountered states.</p>
<h3 id="heading-balancing-exploration-and-exploitation">Balancing Exploration and Exploitation</h3>
<p>A fundamental challenge in reinforcement learning is the exploration-exploitation dilemma:</p>
<ul>
<li><p>Exploitation: Using current knowledge to maximize rewards</p>
</li>
<li><p>Exploration: Trying new actions to discover potentially better strategies</p>
</li>
</ul>
<p>In chess, this would be like deciding whether to play a familiar opening (exploitation) or try a new one (exploration).</p>
]]></content:encoded></item><item><title><![CDATA[Exploration vs. Exploitation: A Deep Dive into Multi-armed Bandits]]></title><description><![CDATA[A k-armed Bandit Problem
Imagine you're at a casino, faced with a row of slot machines (one-armed bandits), each with its own hidden probability of paying out. Your goal is to maximize your winnings over the night, but you don't know which machines h...]]></description><link>https://blogs.yashpatel.xyz/exploration-vs-exploitation-a-deep-dive-into-multi-armed-bandits</link><guid isPermaLink="true">https://blogs.yashpatel.xyz/exploration-vs-exploitation-a-deep-dive-into-multi-armed-bandits</guid><dc:creator><![CDATA[Yash Patel]]></dc:creator><pubDate>Wed, 19 Feb 2025 00:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1742634237092/3b2b0aa8-1c6c-4dd1-b565-81dfbe56e2af.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-a-k-armed-bandit-problem">A k-armed Bandit Problem</h2>
<p>Imagine you're at a casino, faced with a row of slot machines (one-armed bandits), each with its own hidden probability of paying out. Your goal is to maximize your winnings over the night, but you don't know which machines have the best odds. This is essentially the k-armed bandit problem - a fundamental challenge in reinforcement learning that elegantly captures the exploration-exploitation dilemma.</p>
<p>In mathematical terms, we have k different actions (pulling different slot machine arms), and each action a has an expected reward, which we call its value q*(a). When we select an action At at time step t, we receive a reward Rt drawn from a probability distribution dependent on the selected action:</p>
<p>$$q^*(a) \doteq \mathbb{E}[R_t | A_t = a]$$</p><p>The catch? We don't know these values in advance. We can only estimate them based on our experience, and these estimates are denoted as Qt(a).</p>
<p>This creates our central dilemma: should we exploit our current knowledge by selecting what appears to be the best arm based on our limited experience (the "greedy" action), or should we explore other arms to potentially discover better options? If we always exploit, we might get stuck with a suboptimal arm. If we always explore, we'll learn a lot but might not maximize our rewards.</p>
<p>This isn't just a theoretical problem. A doctor choosing treatments for patients faces this exact dilemma - should they stick with the treatment that has worked best so far (exploit) or try a promising new alternative (explore)? The stakes in such real-world scenarios are often much higher than casino winnings.</p>
<h2 id="heading-action-value-methods">Action-value Methods</h2>
<p>To tackle the k-armed bandit problem, we need ways to estimate the value of each action and strategies to select actions based on these estimates.</p>
<p>The most natural approach to estimating action values is to average the rewards we've received when taking that action:</p>
<p>$$Q_t(a) \doteq \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}{A_i=a}}{\sum{i=1}^{t-1} \mathbb{1}_{A_i=a}}$$</p><p>Where 1predicate equals 1 when the predicate is true and 0 otherwise. This is called the sample-average method, and by the law of large numbers, as we take action a more and more times, Qt(a) will converge to the true value q*(a).</p>
<p>Once we have these estimates, how do we select actions? The simplest approach is the greedy method:</p>
<p>$$A_t \doteq \underset{a}{\operatorname{argmax}} Q_t(a)$$</p><p>This always selects the action with the highest estimated value. But as we've discussed, pure exploitation can lead to suboptimal results.</p>
<p>A simple but effective alternative is the ε-greedy method: with probability ε, we select a random action (exploration), and with probability 1-ε, we select the greedy action (exploitation). This ensures that all actions will eventually be tried enough times for their estimates to converge to their true values.</p>
<p>For example, if we have two actions and set ε=0.5, there's a 50% chance we explore randomly (25% chance for each action) and a 50% chance we exploit. So the greedy action has a 50% + 25% = 75% chance of being selected, while the non-greedy action has only a 25% chance.</p>
<h2 id="heading-the-10-armed-testbed">The 10-armed Testbed</h2>
<p>To evaluate these methods empirically, Sutton and Barto created a testbed with 2,000 randomly generated k-armed bandit problems where k=10. For each problem, the true action values q*(a) were selected from a normal distribution with mean 0 and variance 1. When an action was selected, the actual reward was drawn from a normal distribution with mean q*(a) and variance 1.</p>
<p>Let me walk you through what this means in practice. Imagine one specific bandit problem in this testbed. The true values of the 10 arms might be something like:<br />[-0.7, 0.5, 1.2, -0.3, 0.8, -1.5, 0.2, 1.5, -0.9, 0.1]</p>
<p>The best arm here is arm 8 with a value of 1.5, but we don't know this in advance. When we pull this arm, we don't get exactly 1.5 as a reward - we get a random value from a normal distribution centered at 1.5. Sometimes we might get 2.3, other times 0.7, occasionally even negative values, though they're less likely.</p>
<p>The results from running different methods on this testbed were revealing:</p>
<ol>
<li><p>The greedy method (ε=0) improved quickly at first but often got stuck on suboptimal actions, achieving an average reward of only about 1 out of a possible 1.55.</p>
</li>
<li><p>The ε-greedy methods performed better in the long run because they continued to explore. With ε=0.1, the method explored more and usually found the optimal action earlier but never selected it more than 91% of the time. With ε=0.01, it improved more slowly but eventually performed better.</p>
</li>
</ol>
<p>These results highlight a crucial insight: the value of exploration depends on the specific problem. If rewards were more variable (higher variance), exploration would be even more beneficial. If rewards were deterministic, a greedy approach might work better. And in nonstationary environments, where the best action changes over time, ongoing exploration becomes essential.</p>
<h2 id="heading-incremental-implementation">Incremental Implementation</h2>
<p>When implementing these methods, we need to efficiently compute the action-value estimates. The naive approach would store all rewards and recompute the average each time, but this would require increasing memory and computation as we gather more experience.</p>
<p>Fortunately, we can use an incremental update formula. If Qn is our estimate after n-1 rewards, and we receive a new reward Rn, we can update our estimate as:</p>
<p>$$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$</p><p>This elegant formula requires storing only two values per action (the current estimate Qn and the count n) and performs a constant amount of computation per step.</p>
<p>This formula follows a general pattern that appears throughout reinforcement learning:</p>
<p>$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$</p><p>The term [Target - OldEstimate] represents an error in our estimate, and we move our estimate toward the target by a fraction determined by the step size.</p>
<p>Here's a complete algorithm for the ε-greedy bandit method with incremental updates:</p>
<pre><code class="lang-plaintext">textInitialize, for a = 1 to k:
    Q(a) ← 0
    N(a) ← 0

Loop forever:
    A ← {
        argmax_a Q(a)  with probability 1-ε
        a random action  with probability ε
    }
    R ← bandit(A)
    N(A) ← N(A) + 1
    Q(A) ← Q(A) + 1/N(A) * [R - Q(A)]
</code></pre>
<p>This algorithm efficiently implements the ε-greedy approach with sample-average estimation, requiring minimal memory and computation.</p>
<h2 id="heading-tracking-a-non-stationary-problem">Tracking a Non-stationary Problem</h2>
<p>In many real-world scenarios, the true values of actions change over time - what was the best action yesterday might not be the best today. Think of our casino example: what if the slot machines were being adjusted throughout the night, changing their payout probabilities?</p>
<p>The sample-average method we've discussed gives equal weight to all rewards, which isn't ideal for tracking changing values. Instead, we can use a constant step-size parameter α:</p>
<p>$$Q_{n+1} \doteq Q_n + \alpha [R_n - Q_n]$$</p><p>This results in Qt being a weighted average of past rewards, with more recent rewards given more weight:</p>
<p>$$Q_{n+1} = (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha (1-\alpha)^{n-i} R_i$$</p><p>The weight given to reward Ri is α(1-α)^(n-i), which decreases exponentially as the reward gets older. This is called an exponential recency-weighted average.</p>
<p>For example, with α=0.1, the weight of the most recent reward is 0.1, the previous reward gets 0.09, the one before that 0.081, and so on. This allows our estimates to adapt to changing values.</p>
<p>While constant step sizes are great for nonstationary problems, they don't guarantee convergence to the true action values in stationary problems. For convergence, step-size parameters αn(a) must satisfy:</p>
<p>$$\sum_{n=1}^{\infty} \alpha_n(a) = \infty$$</p><p>$$\sum_{n=1}^{\infty} \alpha_n^2(a) &lt; \infty$$</p><p>The sample-average method with αn(a)=1/n satisfies these conditions, but constant step sizes don't. However, in practice, the ability to track nonstationary problems often outweighs the theoretical guarantee of convergence in stationary ones.</p>
<h2 id="heading-optimistic-initial-values">Optimistic Initial Values</h2>
<p>Another approach to encouraging exploration is through optimistic initial values. Instead of initializing our action-value estimates to zero or some neutral value, we set them to be optimistically high - higher than we expect any action's true value to be.</p>
<p>When we select actions based on these optimistic estimates, we'll inevitably be "disappointed" by the actual rewards, which will be lower than our initial estimates. This disappointment drives exploration: as we update our estimates downward for the actions we've tried, untried actions (which still have their optimistic initial values) become relatively more attractive.</p>
<p>For example, in the 10-armed testbed where true action values are normally distributed with mean 0, setting initial estimates to +5 is wildly optimistic. This causes the algorithm to try all actions several times before settling into more exploitative behavior.</p>
<p>The beauty of this approach is that it drives exploration without requiring random action selection. A purely greedy algorithm with optimistic initialization will explore extensively in its early stages.</p>
<p>However, optimistic initialization has a significant limitation: its exploration is inherently temporary. Once all actions have been tried enough to bring their estimates down from the initial optimistic values, the method becomes purely exploitative. If the environment changes later, creating a renewed need for exploration, optimistic initialization can't help.</p>
<h2 id="heading-upper-confidence-bound-action-selection">Upper-Confidence-Bound Action Selection</h2>
<p>The Upper-Confidence-Bound (UCB) algorithm takes a more sophisticated approach to the exploration-exploitation trade-off. It selects actions according to:</p>
<p>$$A_t \doteq \underset{a}{\operatorname{argmax}} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$$</p><p>Where Nt(a) is the number of times action a has been selected prior to time t, and c &gt; 0 controls the degree of exploration.</p>
<p>This formula balances two factors:</p>
<ol>
<li><p>The estimated value Qt(a) (exploitation)</p>
</li>
<li><p>A measure of uncertainty √(ln t / Nt(a)) (exploration)</p>
</li>
</ol>
<p>The uncertainty term increases when an action hasn't been selected for a while (as t increases but Nt(a) doesn't) and decreases when an action is selected (as Nt(a) increases). This naturally drives exploration toward actions with promising values or high uncertainty.</p>
<p>In our casino analogy, UCB is like a strategic gambler who not only considers which machines have paid well so far but also which ones haven't been tried enough to be confident about their payouts.</p>
<p>UCB often performs well in practice, outperforming ε-greedy methods on many problems. However, it's more complex to extend beyond simple bandit problems to the full reinforcement learning setting, particularly for nonstationary problems or large state spaces.</p>
<h2 id="heading-gradient-bandit-algorithms">Gradient Bandit Algorithms</h2>
<p>Gradient bandit algorithms take a different approach. Instead of estimating action values directly, they learn a preference Ht(a) for each action. Higher preferences make an action more likely to be selected, but preferences don't have any interpretation in terms of reward.</p>
<p>Actions are selected according to a soft-max (Boltzmann) distribution:</p>
<p>$$\Pr\{A_t = a\} \doteq \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}} \doteq \pi_t(a)$$</p><p>Initially, all preferences are equal (e.g., Ht(a)=0 for all a), so all actions have an equal probability of being selected.</p>
<p>After selecting action At and receiving reward Rt, the preferences are updated as:</p>
<p>$$H_{t+1}(A_t) \doteq H_t(A_t) + \alpha (R_t - \bar{R}_t) (1 - \pi_t(A_t))$$</p><p>$$H_{t+1}(a) \doteq H_t(a) - \alpha (R_t - \bar{R}_t) \pi_t(a), \quad \text{for all } a \neq A_t$$</p><p>Where α &gt; 0 is a step-size parameter, and RˉtRˉt is the average of all rewards up to time t, serving as a baseline.</p>
<p>This update rule increases the preference for the selected action if the reward was higher than the baseline and decreases it if the reward was lower. Non-selected actions move in the opposite direction.</p>
<p>What's fascinating about this algorithm is that it can be derived as a stochastic gradient ascent method, maximizing the expected reward. The derivation involves calculus but shows that the algorithm is moving the action preferences in the direction that increases expected reward.</p>
<p>In our casino example, the gradient bandit algorithm would be like a gambler who develops intuitive preferences for different machines rather than trying to estimate their exact payout rates. After each play, they adjust their preferences based on whether the reward was better or worse than what they've been getting on average.</p>
<h2 id="heading-associative-search-contextual-bandits">Associative Search (Contextual Bandits)</h2>
<p>So far, we've considered nonassociative tasks where there's a single best action (or a best action that changes over time). But in many real-world problems, the best action depends on the situation or context.</p>
<p>Imagine if our casino had multiple slot machines, but their payout probabilities changed based on a visible signal - perhaps a color displayed on the machine. This is an associative search task, also called a contextual bandit problem. We need to learn not just which action is best overall, but which action is best in each context.</p>
<p>For example, if the machine shows red, arm 1 might be best; if it shows green, arm 2 might be best. By learning these associations, we can perform much better than if we treated all situations as the same.</p>
<p>Associative search tasks bridge the gap between the simple k-armed bandit problem and the full reinforcement learning problem. They involve learning a policy (a mapping from situations to actions), but like bandits, each action only affects the immediate reward, not future situations.</p>
<p>Consider a concrete example: suppose you face a 2-armed bandit where the true values randomly switch between two cases:</p>
<ul>
<li><p>Case A: action 1 has value 10, action 2 has value 20</p>
</li>
<li><p>Case B: action 1 has value 90, action 2 has value 80</p>
</li>
</ul>
<p>If you can't tell which case you're in, the best you can do is always select action 1, giving an expected reward of (10+90)/2 = 50 per step. But if you're told which case you're facing, you can select action 2 in case A and action 1 in case B, achieving an expected reward of (20+90)/2 = 55 per step.</p>
<p>This example illustrates how contextual information can significantly improve performance, allowing us to adapt our actions to the specific situation we're in.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The multi-armed bandit problem provides a rich framework for understanding the exploration-exploitation dilemma that lies at the heart of reinforcement learning. We've explored several approaches to balancing these competing needs:</p>
<ul>
<li><p>ε-greedy methods, which explicitly separate exploration and exploitation</p>
</li>
<li><p>Optimistic initialization, which drives exploration through initially high value estimates</p>
</li>
<li><p>Upper-Confidence-Bound algorithms, which balance exploitation with uncertainty-based exploration</p>
</li>
<li><p>Gradient bandit algorithms, which learn action preferences rather than value estimates</p>
</li>
<li><p>Contextual bandits, which extend these ideas to situation-dependent action selection</p>
</li>
</ul>
<p>Each method has its strengths and weaknesses, and their relative performance depends on the specific problem characteristics. UCB methods often perform best on stationary problems, while constant-α methods adapt better to nonstationary environments.</p>
<p>The exploration-exploitation dilemma extends far beyond bandit problems. It appears in various forms throughout reinforcement learning and is a fundamental challenge in any learning system that must make decisions based on limited information.</p>
<p>As we move from bandits to the full reinforcement learning problem, we'll see how these ideas extend to sequential decision-making, where actions affect not just immediate rewards but also future situations and opportunities. The methods we've explored here provide a foundation for understanding these more complex scenarios.</p>
<p>In our casino analogy, we're no longer just playing individual slot machines - we're navigating a complex casino where each decision affects not only our immediate winnings but also which games we'll have access to next. This is the full reinforcement learning problem, and it's the focus of the rest of our journey.</p>
]]></content:encoded></item><item><title><![CDATA[Reinforcement Learning: First Principles]]></title><description><![CDATA[NoteThis blog is primarily based on my personal notes from Chapter 1 of Reinforcement Learning: An Introduction by Sutton and Barto, structured to improve

Hey there! Welcome to my blog series on reinforcement learning (RL). I'm currently reading Rei...]]></description><link>https://blogs.yashpatel.xyz/reinforcement-learning-first-principles</link><guid isPermaLink="true">https://blogs.yashpatel.xyz/reinforcement-learning-first-principles</guid><category><![CDATA[Reinforcement Learning]]></category><dc:creator><![CDATA[Yash Patel]]></dc:creator><pubDate>Wed, 05 Feb 2025 00:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1738942220129/e4e3c859-4712-4987-b818-45e41d2af027.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<details><summary>Note</summary><div data-type="detailsContent">This blog is primarily based on my personal notes from Chapter 1 of Reinforcement Learning: An Introduction by Sutton and Barto, structured to improve</div></details>

<p>Hey there! Welcome to my blog series on reinforcement learning (RL). I'm currently reading <a target="_blank" href="http://incompleteideas.net/book/the-book-2nd.html"><em>Reinforcement Learning: An Introduction by Sutton and Barto</em></a>, and as I go through the book, I am making my own notes. I wanted to take this opportunity to document my learning process in the form of blog posts. My goal with this series is to explain RL concepts from my perspective and break them down in an easy-to-understand way. If you have a background equivalent to a BSc in Computer Science, you should be able to follow along comfortably.</p>
<p>That said, I want to be upfront about the fact that I am still learning. I'm no expert in this field, so if I make any mistakes along the way, I sincerely apologize in advance. If you spot any errors or have any suggestions, I would love to hear from you! Feel free to reach out to me via email—I would be more than happy to learn and improve.</p>
<p>As I am self-learning RL, I have also used AI tools to help explain topics that I initially found difficult to grasp. I believe in leveraging available resources to make learning more efficient. I will aim to publish one blog post per chapter of the book. However, due to my full-time job and my ongoing final-year university project, I can't promise a fixed schedule for when each post will be published. But rest assured, I am committed to completing this series, and I hope it will be a useful resource for anyone else looking to learn RL!</p>
<h2 id="heading-reinforcement-learning">Reinforcement Learning</h2>
<p>Reinforcement learning (RL) is a paradigm of learning where an agent interacts with its environment to discover optimal actions through a process of trial and error. Unlike supervised learning, where explicit instructions or labeled data guide the learning process, RL requires the agent to explore different actions, assess their consequences, and refine its strategy based on received rewards. One of the most defining aspects of RL is that rewards can be immediate or delayed, making long-term planning an essential component of effective learning.</p>
<h3 id="heading-key-characteristics-of-reinforcement-learning">Key Characteristics of Reinforcement Learning</h3>
<ul>
<li><p><strong>Trial and Error Learning</strong>: The agent learns by continuously experimenting with different actions and observing their consequences.</p>
</li>
<li><p><strong>Delayed Rewards</strong>: Unlike immediate feedback systems, RL often involves scenarios where an agent's actions influence future rewards, requiring it to balance short-term gains with long-term benefits.</p>
</li>
</ul>
<h3 id="heading-how-reinforcement-learning-is-viewed">How Reinforcement Learning is Viewed</h3>
<p>Reinforcement learning can be categorized in three primary ways:</p>
<ul>
<li><p><strong>As a problem to be solved</strong>: RL aims to determine the best decision-making strategy to maximize long-term rewards.</p>
</li>
<li><p><strong>As a collection of solution methods</strong>: Various computational approaches and algorithms exist to tackle RL problems, including value-based, policy-based, and model-free learning techniques.</p>
</li>
<li><p><strong>As a research field</strong>: RL is an active domain of research that seeks to refine algorithms, explore theoretical underpinnings, and expand its applications across multiple industries.</p>
</li>
</ul>
<h3 id="heading-formalizing-the-rl-problem">Formalizing the RL Problem</h3>
<p>The RL framework is formalized using concepts from dynamical systems theory, particularly as the optimal control of partially known <strong>Markov Decision Processes (MDPs)</strong>. MDPs serve as the mathematical foundation of RL, structuring the learning problem into key components:</p>
<ul>
<li><p><strong>Sensation (State Perception)</strong>: The agent must perceive its environment and identify the current state.</p>
</li>
<li><p><strong>Action (Decision Making)</strong>: The agent must decide on an action that influences the state of the environment.</p>
</li>
<li><p><strong>Goal (Optimization Objective)</strong>: The agent’s objective is to maximize cumulative rewards over time by making intelligent action choices.</p>
</li>
</ul>
<h3 id="heading-the-role-of-learning-agents">The Role of Learning Agents</h3>
<p>To function effectively in an RL framework, an agent must possess the ability to:</p>
<ul>
<li><p>Sense and interpret the state of the environment.</p>
</li>
<li><p>Take actions that actively influence the environment’s state.</p>
</li>
<li><p>Align its decision-making strategy with a predefined goal that ensures long-term success.</p>
</li>
</ul>
<p>Any approach that successfully enables an agent to navigate these challenges qualifies as a reinforcement learning method.</p>
<h3 id="heading-differences-between-supervised-unsupervised-and-reinforcement-learning">Differences Between Supervised, Unsupervised, and Reinforcement Learning</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td>Supervised Learning</td><td>Unsupervised Learning</td><td>Reinforcement Learning</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Data Dependency</strong></td><td>Requires labeled data for training.</td><td>Uses unlabeled data to identify patterns.</td><td>No predefined dataset; learns through interactions.</td></tr>
<tr>
<td><strong>Objective</strong></td><td>Learns to map input to output using labeled examples.</td><td>Identifies hidden structures and relationships in data.</td><td>Learns to maximize cumulative rewards through actions.</td></tr>
<tr>
<td><strong>Feedback Type</strong></td><td>Direct supervision through labeled examples.</td><td>No direct feedback, only pattern recognition.</td><td>Reward-based feedback guiding decision-making.</td></tr>
<tr>
<td><strong>Learning Process</strong></td><td>Trains using historical data and supervised loss functions.</td><td>Clusters or reduces dimensionality of data for better insights.</td><td>Explores and exploits different actions to optimize performance.</td></tr>
<tr>
<td><strong>Application</strong></td><td>Image classification, speech recognition, spam detection.</td><td>Customer segmentation, anomaly detection, recommendation systems.</td><td>Game playing, robotics, autonomous systems, financial trading.</td></tr>
<tr>
<td><strong>Decision Dependency</strong></td><td>Decisions are independent of past predictions.</td><td>Decisions are based on statistical structures.</td><td>Decisions depend on past actions and state transitions.</td></tr>
</tbody>
</table>
</div><h3 id="heading-additional-considerations">Additional Considerations</h3>
<p>While the above table highlights key differences, RL introduces additional complexities not present in supervised or unsupervised learning. One crucial aspect is <strong>exploration vs. exploitation</strong>, where an agent must balance between trying new actions (exploration) and leveraging known information (exploitation) to maximize rewards. RL is also distinct in its emphasis on <strong>long-term dependencies</strong>, as agents often need to strategize for future rewards rather than merely optimizing for immediate outcomes. Additionally, RL environments are <strong>dynamic</strong>, meaning that the agent's policy and the environment itself may evolve over time based on interactions.</p>
<p>Another unique characteristic of RL is its reliance on <strong>trial-and-error learning</strong>, where agents make decisions without prior knowledge and gradually refine their approach through experience. This is particularly useful in scenarios where optimal decision-making cannot be explicitly programmed but must be discovered through continuous learning.</p>
<p>Reinforcement learning (RL) takes a comprehensive approach by considering the entire goal-directed agent interacting with an uncertain environment. In contrast, many other learning approaches focus on isolated subproblems without addressing the bigger picture. While these alternative approaches have yielded valuable insights, their emphasis on fragmented issues presents a significant limitation when dealing with dynamic and evolving environments.</p>
<p>It is typically assumed that an RL agent must operate in the face of significant uncertainty. When RL involves planning, it must address the balance between strategic foresight and real-time decision-making, while also tackling how environmental models are developed and refined. Similarly, when RL incorporates supervised learning, it does so with a clear purpose—defining which capabilities are essential for the learning process and which are not. For learning research to advance meaningfully, subproblems must be examined in ways that align with the broader goal of creating complete, interactive, and goal-seeking agents.</p>
<p>As Sutton and Barto state in their book:</p>
<blockquote>
<p>It is not clear how far back the pendulum will swing, but reinforcement learning research is certainly part of the swing back toward simpler and fewer general principles of artificial intelligence.</p>
</blockquote>
<h2 id="heading-elements-of-reinforcement-learning">Elements of Reinforcement Learning</h2>
<p>Reinforcement learning consists of four fundamental elements that define how an agent interacts with its environment and learns over time.</p>
<h3 id="heading-1-policy">1. Policy</h3>
<p>A <strong>policy</strong> is the strategy that an agent follows while making decisions. It maps the agent's perceived state of the environment to the actions it should take. Policies can be deterministic (choosing a specific action for a state) or stochastic (choosing actions based on probability distributions). The policy serves as the brain of the agent, determining its behavior at any given time.</p>
<h3 id="heading-2-reward-signal">2. Reward Signal</h3>
<p>The <strong>reward signal</strong> is the primary feedback mechanism in RL. Each action taken by the agent results in a reward, which is a numerical value that indicates how beneficial that action was. The objective of an RL agent is to maximize the cumulative reward over time. Rewards drive the learning process, reinforcing actions that lead to better outcomes and discouraging less optimal actions.</p>
<h3 id="heading-3-value-function">3. Value Function</h3>
<p>The <strong>value function</strong> estimates the long-term desirability of a given state. Unlike the reward signal, which provides immediate feedback, the value function helps the agent assess how beneficial a state is in the long run. It enables the agent to make more informed decisions by considering future rewards, rather than focusing solely on immediate gains.</p>
<h3 id="heading-4-model-of-the-environment-optional">4. Model of the Environment (Optional)</h3>
<p>Some RL methods use a <strong>model of the environment</strong> to predict state transitions and rewards before taking actions. These are called model-based methods. Conversely, model-free methods do not rely on such predictions and instead learn purely from trial and error. A model can enhance planning capabilities by simulating future scenarios and refining decision-making processes.</p>
<h3 id="heading-analogy-a-chefs-culinary-journey"><em>Analogy: A Chef’s Culinary Journey</em></h3>
<p>Imagine a chef who is learning to create the perfect dish. The <strong>policy</strong> represents the chef's approach to cooking—whether they follow a recipe strictly or experiment with different ingredients. The <strong>reward signal</strong> comes from the feedback they receive, either from tasting their dish or from customer reviews. Over time, the chef learns which combinations of flavors and techniques result in the best dishes—this is akin to developing a <strong>value function</strong>, where they refine their intuition for what works best. If the chef keeps detailed notes on ingredient substitutions and cooking methods, they are essentially building a <strong>model of the environment</strong>, helping them anticipate the outcome of their future creations.</p>
<h2 id="heading-summary-of-key-points">Summary of Key Points</h2>
<ul>
<li><p><strong>Reinforcement Learning (RL)</strong> is a distinct learning paradigm that relies on interaction with the environment rather than labeled data.</p>
</li>
<li><p><strong>Key difference from Supervised and Unsupervised Learning:</strong> RL focuses on decision-making through trial-and-error and reward signals.</p>
</li>
<li><p><strong>Core Elements of RL:</strong></p>
<ul>
<li><p><strong>Policy:</strong> The strategy an agent follows to decide actions.</p>
</li>
<li><p><strong>Reward Signal:</strong> Feedback mechanism guiding the learning process.</p>
</li>
<li><p><strong>Value Function:</strong> Estimates long-term benefits of being in a particular state.</p>
</li>
<li><p><strong>Model of the Environment (optional):</strong> Used for planning in model-based methods.</p>
</li>
</ul>
</li>
<li><p><strong>Exploration vs. Exploitation:</strong> Balancing between trying new actions and leveraging past knowledge.</p>
</li>
<li><p><strong>Delayed Rewards:</strong> Actions may not provide immediate benefits, making long-term planning crucial.</p>
</li>
<li><p><strong>Markov Decision Processes (MDPs):</strong> The mathematical framework behind RL.</p>
</li>
<li><p><strong>Trial-and-Error Learning:</strong> RL agents learn through continuous interaction rather than pre-labeled data.</p>
</li>
<li><p><strong>Dynamic Environments:</strong> RL methods adapt to changing conditions and optimize policies over time.</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>