The Alignment Tax Is Not a Law of Nature

Adil Amin · May 2026 · ZEHEN Labs

There’s a moment in every model family’s life when something flips. Before that moment, getting smarter makes the model less honest. After it, getting smarter makes it more honest. We found that moment, measured it, and showed you can move it.

The Trade-off Everyone Assumes

Ask anyone in AI safety: “Does scaling make alignment harder?” Most will say yes. It’s treated as a law of nature — bigger models, bigger problems.

We measured it. Across 63 base models in 16 families, we tracked how reasoning (HellaSwag) and truthfulness (TruthfulQA) relate as models get bigger. Not whether each improves — whether they help or hurt each other.

The Phase Transition

Think of water freezing. Above 0°C, molecules move freely. Below, they lock into crystal. The physics doesn’t gradually change — it flips at a sharp boundary.

AI capabilities do the same thing. Below a critical scale, reasoning and truthfulness are anti-correlated (r = −0.989 in Pythia). Train the model to reason better, and it gets less truthful. This is the alignment tax. It’s real. Every web-trained family shows it.

But above that critical scale, the sign flips. Capabilities cooperate. Better reasoning = better truthfulness. No trade-off. The tax was a phase, not a law.

Running coupling γ₁₂ vs model scale. Each dot is a model. Watch the coupling flip from negative (tax) to positive (bonus) as models cross N_c. Six families shown — different colors, same pattern. The zero line is the phase boundary.

Tax Phase

Below N_c

Capabilities fight
γ₁₂ < 0

Transition

At N_c

Maximum leverage
γ₁₂ = 0

Bonus Phase

Above N_c

Capabilities cooperate
γ₁₂ > 0

It’s Not One Number

The critical scale N_c isn’t a universal constant. It’s a design parameter. OPT hits it at 0.12B. Pythia at 3.5B. Falcon at 7B. That’s a 60× range.

Even more interesting: curated models like Phi and Qwen3 bypass the tax entirely. Their N_c is effectively below the smallest model tested. Data curation doesn’t just improve quality — it moves the phase boundary. Phi at 1B achieves coupling characteristic of standard-trained 10B models.

Three levers shift N_c independently: data curation, model width, and architecture. Each is measurable. Each is actionable.

What This Looks Like at Frontier Scale

This is where Paper 3B picks up. At frontier scale — 39 models from 10 labs — we can’t measure HellaSwag vs TruthfulQA anymore (they’re saturated). But SWE-bench and GPQA Diamond are the new axes, and they cooperate too: r = +0.72.

The h-field is the key diagnostic. It’s simply how far each model deviates from the cooperation trend. Positive h = reasoning-rich. Negative h = coding-rich. One number tells you a model’s training philosophy.

Google h = +5.5

xAI h = +5.1

OpenAI h = +3.1

Meta h = +2.4

DeepSeek h = +1.9

MiniMax h = −2.3

Anthropic h = −6.9

Why “h-field”?

In physics, the external magnetic field h breaks symmetry between spin-up and spin-down. In CAPE, the h-field is the external force — the training recipe — that pushes a model off the natural cooperation trend. A coding-heavy recipe pushes h negative (like a field favoring spin-down). A reasoning-heavy recipe pushes h positive. The field is external to the model’s intrinsic coupling — it’s what the lab chose, not what the architecture wants.

Frontier coupling: 31 models, 8 labs — (a) SWE-bench vs GPQA Diamond: frontier models cooperate (r = +0.72). Distance from the dashed line = h-field. (b) Per-lab h-field: Google reasoning-rich, Anthropic coding-rich. (c) Anthropic’s trajectory through coupling space — the oscillation between tax excursions and recovery.

Google consistently invests in reasoning (h stays positive across releases). Anthropic is coding-rich (h = −6.9 on average) — but this isn’t permanent. When Sonnet 4.6 went deep into a coding excursion (h = −13.1), Opus 4.6 recovered to h = +3.5 at the next release. Tax excursions are temporary. The same pattern shows up at OpenAI (GPT-5.4 dips, GPT-5.2 Pro recovers) and Google (Flash→Pro excursion then recovery).

Three labs, same physics

Coding-specialist releases create local tax excursions that recover at the next generation. The universality of this pattern across Anthropic, OpenAI, and Google — each with different architectures, data, and training recipes — is the strongest evidence that the coupling dynamics are fundamental, not lab-specific.

The Cascade: It Keeps Repeating

Here’s what surprised us most. The transition doesn’t happen once. It repeats at every scale, with different benchmarks each time:

Nc1 (~0.1–7B)

HS ↔ TQA coupling flips

Nc2 (~30–72B)

Internal coupling crashes 59%
SWE ↔ GPQA activate

Nc3 (~114B, predicted)

SWE saturates
IFEval ↔ HLE activate

Nc4 (~200B+, predicted)

IFEval saturates
Next axis TBD

The capability cascade: four transitions, one pattern — Each transition follows the same pattern: old benchmark axes lock together, coupling restructures, new axes emerge. Like a growing crystal reaching the edge of its container and nucleating a new crystal in a different orientation.

At each level, the old benchmarks lock together (they stop discriminating), new ones emerge, and the whole tax-transition-bonus cycle starts fresh. Think of a child learning to walk — at first, balance and speed fight each other (the tax). Then they click, and speed helps balance (the bonus). Then the child starts running, and a new trade-off appears between speed and agility. Each level of mastery creates a new coupling that has to be resolved at the next level.

We measured this directly in OPT’s internal coupling: it rises from 0.514 (125M) to 0.876 (13B), then crashes to 0.356 at 30B — the same pattern as Nc1, repeating at Nc2. Same math. Different scale. Like harmonics of a vibrating string.

The Equation That Predicts Benchmarks

The coupling isn’t just a pattern we noticed. It’s governed by an ODE — a differential equation we discovered from data. Feed it one initial condition (Pythia-70M’s scores) and it predicts all 5 benchmarks across 8 model sizes. Then cross-predict a held-out family (Llama-2) at 5.6% MAE. No physics was assumed — the equation emerged from the data. But it has the same form as equations that govern phase transitions in superconductors.

ODE integration predicts 5 benchmarks from one initial condition — Five benchmarks predicted simultaneously from Pythia-70M initial conditions. The ODE captures how capabilities co-evolve: HellaSwag rises, TruthfulQA falls, and the coupling between them drives both trajectories. The shaded region marks the transition zone. MAE per benchmark shown in lower right of each panel.

Engineering: You Can Skip the Tax

The most practical finding: the alignment tax is eliminatable. Phi at 1B achieves coupling characteristic of standard-trained 10B models. Qwen3 at 1.7B has 100% cooperative heads where Qwen2.5 at 1.5B had 97% competing. One generation of curation erased the tax entirely.

Architecture and curation as engineering levers — (a) Gemma shows architecture trades coupling for dimensionality — PLE opens more capability axes but reduces cooperation. RLHF restores coupling while preserving the extra axes. (b) Qwen shows curation eliminates the tax: Qwen2.5 (red) has near-zero coupling at 1.5B; Qwen3 (gold) has 0.830 at 1.7B. Same scale, different data.

What To Do With This

This isn’t just measurement. It’s actionable:

If you’re training below N_c: Don’t just scale. Curate. One unit of data quality ≈ 10× model scale in coupling improvement.

If you’re at N_c: You’re at the critical point. Small interventions have maximum leverage. This is where alignment ROI is highest.

If you’re deploying a frontier model: Compute the h-field from two public benchmark scores. It takes 30 seconds and tells you your model’s training bias. If |h| > 5, your model is a specialist — plan accordingly.

If you’re evaluating models: Watch the saturation ratio. When the top-5 models compress to <2pp spread on a benchmark, that benchmark is done. The next axis is already activating.

Seven Bets on the Table

We made 7 falsifiable predictions with timestamped deadlines. If we’re wrong, the framework breaks publicly. Three are already confirmed:

1. OLMo at γ₁₂ = 0.000 exactly (confirmed independently by AI2)
2. ODE cross-predicts Llama-2 at 5.6% MAE (2.6× better than polynomial)
3. Qwen3 cooperative at all scales (curation eliminated the tax)

The four remaining predictions test frontier dynamics: SWE saturation by Dec 2026, IFEval activation, lab trajectory persistence, and the Nc4 cascade. The dashboard tracks these live.

Try It

The CAPE Dashboard lets you enter any model’s benchmarks and get its phase, coupling trajectory, h-field, and ODE prediction. The cape-steer CLI lets you run activation-level alignment correction on any open-weight model.

The alignment tax is not a law of nature. It is an engineerable bottleneck — a phase that every model family grows out of, and that good engineering can skip entirely.

Papers: “Lying Is Just a Phase” (Paper 3A) and “The Growing Pains of Frontier Models” (Paper 3B) — NeurIPS 2026. Code and data at github.com/adilamin89/cape-scaling.

Contact: [email protected]