Jon Moshier / Notes / Mixture of Experts draft
Note · From the Notebook

Mixture of Experts

How splitting a model into specialized sub-networks lets you scale parameters without scaling compute.

Mixture of Experts

A sparse-activation architecture where a learned router sends each token to a small subset of specialized sub-networks, so a trillion-parameter model can run as cheaply as a much smaller dense one.

What it is

A Mixture of Experts (MoE) model replaces some or all of the feed-forward layers in a neural network (typically a Transformer) with N independent expert networks plus a lightweight router (also called a gating network). For each input token, the router scores all experts and picks the top-k (usually k=1 or k=2) to process that token. All other experts stay idle.

The payoff: you can have 8× more parameters than a dense model while using roughly the same FLOPs per token, because only a fraction of weights activate per forward pass.

The three moving parts:

  1. Experts — typically FFN blocks with identical architecture but separate learned weights.
  2. Router — a small linear layer + softmax that outputs a probability distribution over experts and selects the top-k.
  3. Load-balancing loss — an auxiliary training term that penalizes all traffic flowing to a few hot experts (which would collapse to a dense model and waste capacity).

The original insight dates to Jacobs et al. 1991, but modern MoE took off when Shazeer et al. (2017) showed it could scale Transformers to 137B parameters while staying within a practical compute budget.


Pros / Cons

Pros

Cons


Key Papers

PaperWhat it contributes
Jacobs et al. 1991 — “Adaptive Mixtures of Local Experts”The original MoE concept; gating network + EM-style training.
Shazeer et al. 2017 — “Outrageously Large Neural Networks”Sparsely-gated MoE layer inside an LSTM; scales to 137B with top-k routing and noisy gating.
Lepikhin et al. 2021 — “GShard”MoE in every other Transformer layer; scales to 600B parameters across TPU pods.
Fedus et al. 2022 — “Switch Transformers”Simplifies routing to top-1; scales to 1.6T parameters; coins the capacity factor formulation.
Zoph et al. 2022 — “ST-MoE”Design rules for stable, fine-tunable sparse models; router z-loss.
Jiang et al. 2024 — “Mixtral of Experts”Open 8x7B and 8x22B models; demonstrates top-2 routing with 2/8 experts active per token.
Dai et al. 2024 — “DeepSeekMoE”Fine-grained experts (many small experts, not few large ones) + shared “always-on” experts; efficient specialization.
DeepSeek-AI 2024 — “DeepSeek-V3”671B total / 37B active; auxiliary-loss-free load balancing; sets open-source SOTA.

Projects Using MoE


Next Steps / How to Go Deeper

1. Run a real MoE in an afternoon

pip install transformers torch accelerate bitsandbytes

Load Mixtral 8x7B in 4-bit quantization (~14 GB VRAM or run on CPU with patience):

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    device_map="auto",
    load_in_4bit=True,
)

Then hook into the router logits to see which experts fire for a given prompt — this is the fastest way to build intuition.

2. Visualize expert routing

The MoE-I2 visualization toolkit and HuggingFace’s output_router_logits=True flag both expose per-token expert assignments. Feed it contrasting inputs (code vs. poetry vs. math) and watch whether the same experts activate.

3. Build a tiny MoE from scratch

Phil Wang’s x-transformers has a clean MoE implementation (~200 lines) that’s good for reading and modifying. Swap in the MoE layer, train on a small task, and log the routing entropy to watch specialization emerge.

4. Read the routing-collapse problem

The MegaBlocks paper (Gale et al. 2022) explains block-sparse matrix ops that make training faster and more stable — it’s what most production MoE implementations use under the hood.

5. Key question to chase

Do experts specialize semantically, or is it arbitrary symmetry-breaking? Answering this empirically (even on a tiny model) connects MoE to the broader question of interpretability — and links naturally to Neural Networks and Brain Scaling.

← All notes Read recent essays →