Attention Is All You Need: What the Paper's Heads Are Actually Doing at Each LayerMar 8, 2026·12 min read
Mixtral of Experts: Top-2 Routing Gives 47B Capacity at 13B Active ComputeAt roughly 2.08e11 cumulative FLOPs in my own run, a dense baseline lands at 25.31 validation perplexity. At nearly the same compute budget, a sparse MoE lands at 20.98. The absolute delta is 4.324, aMar 22, 2026·12 min read
Exploration vs. Exploitation: A Deep Dive into Multi-armed BanditsA k-armed Bandit Problem Imagine you're at a casino, faced with a row of slot machines (one-armed bandits), each with its own hidden probability of paying out. Your goal is to maximize your winnings over the night, but you don't know which machines h...Feb 19, 2025·12 min read