The Hidden Alignment Transition in Language Model Scaling
Enter your model's size + any benchmarks to get alignment phase, scaling recommendations, and predictions. Works for any model from 70M to frontier scale. · Amin (2026) · ZEHEN Labs
3.5BCritical Scale Nc
−0.989Pre-transition r
63 + 39Base + Frontier Models
16Families
0.513Frontier Slope
← ZEHEN LabsGitHubPapers"Lying Is Just a Phase" + "Growing Pains of Frontier Models" (NeurIPS 2026)
How to use this dashboard
Paper 3A (“Lying Is Just a Phase”): enter your model’s size and benchmarks in the Analyze tab to see its phase (tax/transition/bonus), coupling γ₁₂, and what to do. Works for 63 base models across 16 families. Nc varies by family (0.12B–7B).
Paper 3B (“Growing Pains of Frontier Models”): use the h-field tab for frontier models. Enter SWE-bench + GPQA scores to compute the h-field diagnostic — it tells you where your model sits relative to the population trend and what each lab’s training philosophy looks like. 39 models, 10 labs.
Analyze Any Model — Phase Classification + Actionable Recommendations
Capabilities anticorrelate: scaling reasoning hurts truthfulness. Nc varies by family (0.12B–7B, 60× range). Curated families (Phi, Qwen3) bypass the tax entirely. Loss is blind (CV=0.8%) — the transition lives in coupling, not loss.
Curate Data1 unit quality ≈ 10× scalePhi shows: tax is eliminatable
TRANS
~Nc — Critical Point
−0.1 ≤ γ₁₂ ≤ +0.1 (7 models in transition) · Arrhenius C spikes 10×
Maximum susceptibility. Gradient dips 37% below trend. Eigenvector rotates sharply. Small interventions (curation, width, architecture) have maximum leverage. OLMo-1B sits here with γ₁₂ = 0.000 (independent confirmation by AI2).
Max alignment ROIOLMo confirms: zero-param
BONUS
Above Nc — Alignment Bonus
γ₁₂ > +0.1 (local) · r > +0.78 (population, large models) · deff = 1.22
Capabilities cooperate. Scaling helps both reasoning and truthfulness. Arrhenius activation energy C = 196 (vs 28 in Tax). deff decreases — capability manifold condenses. Three levers independently shift Nc: width, data curation, architecture.
OPT internal coupling: 0.876 (13B peak) → 0.356 (30B crash) → 0.396 (66B recovery). Same rise→peak→drop→recovery pattern as Nc1. Three distinct mechanisms converge: output bottleneck (OPT), flat weakening (Llama/Qwen), reversed profile (OLMo-2). Paper 3B documents this across 6 open-weight architectures.
SWE×GPQA now discriminatingNc3 ≈ 114B (predicted)
Frontier Coupling — SWE-bench vs GPQA Diamond (Feb–Mar 2026)
r = +0.72 (n=34, p<10⁻⁶) full panel; r = +0.85 (n=21, core verified SWE↔GPQA). Cooperative coupling confirmed. Sonnet 4.6: h = −13.1 (tax excursion). Opus 4.6: h = +3.5 (recovery). GPT-5.4: h = −1.8 (mild coding-specialist).
Within-Family Trajectory — Anthropic as Phase Diagnostic
Transition
ΔSWE
ΔGPQA
γ₁₂
h(D)
Interpretation
Sonnet 4.5 → Sonnet 4.6
+2.4
−9.3
−3.88
−13.4
Tax excursion: coding optimized at reasoning cost
Sonnet 4.6 → Opus 4.6
+1.2
+17.2
+14.3
+2.8
Recovery: full cooperative phase restored
Protocol: For any two consecutive releases, compute γ₁₂ = ΔGPQA/ΔSWE. If negative: training recipe entered a tax excursion. Single eval run suffices to detect it before deployment.
Within-Family Trajectory — Google Gemini as Independent Test
Transition
ΔSWE
ΔGPQA
γ₁₂
h(D)
Interpretation
2.5 Pro → 3 Flash
+14.2
+6.4
+0.45
+8.9 → +4.1
Cooperative: both improve
3 Flash → 3 Pro
−1.8
+1.5
−0.83
+4.1 → +7.0
Flash→Pro tradeoff: reasoning prioritized over coding
3 Pro → 3.1 Pro
+4.4
+2.4
+0.55
+7.0 → +6.0
Recovery: both capabilities improve
Second within-family test: Gemini's h-field stays positive throughout (+4 to +9) — a reasoning-specialist training recipe, the frontier analogue of Phi.
The Flash→Pro excursion (γ₁₂ = −0.83) mirrors Anthropic's Sonnet→Opus pattern: tier-specialist training creates a local tax that recovers at the next release.
Two labs, same physics.
OpenAI Trajectory — Now With Tax Excursion (GPT-5.4)
Transition
ΔSWE
ΔGPQA
γ₁₂
h(D)
Interpretation
GPT-4o → GPT-5
+41.7
+32.1
+0.77
+2.5 → +1.7
Strongly cooperative: massive joint gain
GPT-5 → GPT-5.1
+1.4
+2.4
+1.71
+1.7 → +3.0
Cooperative: reasoning outpaces coding
GPT-5.1 → GPT-5.4
+0.9
−3.9
−4.33
+3.0 → −1.6
Tax excursion: coding optimized at reasoning cost
GPT-5.4 → GPT-5.2 Pro
+2.8
+9.0
+3.21
−1.6 → +5.2
Recovery: full cooperative phase restored
Update: GPT-5.4 shows the same tax excursion pattern as Anthropic's Sonnet 4.6 (γ₁₂ = −4.33 vs −3.88). h dips to −1.6 before GPT-5.2 Pro recovers to +5.2.
Three labs, same physics: coding-specialist releases create local tax excursions that recover at the next generation. The universality of this pattern across Anthropic, OpenAI, and Google is now confirmed.
Frontier 3×3 Coupling Matrix — SWE · GPQA · IFEval
det(H_2×2) → 0. Third eigenvalue becomes significant. Pairwise γ₁₂ insufficient: need 3×3 coupling matrix. Future work extends to higher dimensions.
Arrhenius Activation Energy per Phase — New Result
The Arrhenius form log(rate) = A − C/S was fit separately in each coupling phase. The activation constant C is not universal — it spikes 10× at the phase boundary. This is the thermodynamic signature of the saddle point.
Phase
Scale Range
C_Arrhenius
r²
Interpretation
Tax
70M–1B
28
0.32
Shallow activation barrier
Transition
1B–2.8B
316 ★
0.88
10× spike = saddle point of loss landscape
Bonus
2.8B–12B
196
0.94
Deeper cooperative well
log(dS/dlog₁₀N) = A − C/S
Arrhenius structure survives all three phases. The 10× C_Arr spike at Nc directly explains the 37% gradient dip — measurable from gradient norms without any benchmark data.
Benchmark Survival at Each Nc — Eigenvector Analysis
Scale
Active Phase
Discriminating Benchmarks
New Dimension Trigger
70M–3.5B
Tax
HellaSwag, TruthfulQA
—
~3.5B
Nc1
HS⊕TQA coupling flips
MMLU enters below chance at ~3B
3.5B–70B
Bonus
HS, TQA, MMLU all cooperative
—
~70B–130B
Frontier
SWE-bench, GPQA Diamond
IFEval λ₁ loading = 0.64 (dominant)
~114B
Nc3
IFEval + agentic safety
HarmBench / AgentBench (recommended)
Phase-Separated Correlation Matrix — How TQA Restructures at Nc
▸ BELOW Nc (TAX PHASE)
HS
TQA
ARC
MMLU
WG
HS
1.00
−0.53
+0.89
+0.74
+0.67
TQA
−0.53
1.00
−0.65
−0.12
−0.28
ARC
+0.89
−0.65
1.00
+0.82
+0.71
MMLU
+0.74
−0.12
+0.82
1.00
+0.52
WG
+0.67
−0.28
+0.71
+0.52
1.00
4/10 pairs negative • deff = 1.53 • Mean r = +0.07
▸ ABOVE Nc (BONUS PHASE)
HS
TQA
ARC
MMLU
WG
HS
1.00
+0.91
+0.95
+0.90
+0.73
TQA
+0.91
1.00
+0.92
+0.85
+0.69
ARC
+0.95
+0.92
1.00
+0.93
+0.72
MMLU
+0.90
+0.85
+0.93
1.00
+0.62
WG
+0.73
+0.69
+0.72
+0.62
1.00
0/10 pairs negative • deff = 1.20 • Mean r = +0.89
Key finding: The restructuring is specific to truthfulness. All 4 TQA pairs flip sign across Nc (Frobenius |Δr| = 1.56).
Only 0/6 non-TQA pairs flip (|Δr| = 0.33). TQA loads anti-aligned with PC1 below Nc (+0.49 vs −0.49 for HS), aligned above.
Phase-by-Phase Progression — deff Peaks at Transition (Critical Fluctuations)
Tax Phase
1.53
deff • 4 neg pairs • TQA anti-aligned
Transition
1.81
deff PEAK • Max fluctuations at Nc
Bonus Phase
1.20
deff • 0 neg pairs • All cooperative
Frontier
1.15
deff • Deep cooperative
Nc,3 regime
1.33
deff • All positive but rising — new tax opening?
Physics prediction confirmed: deff peaks at 1.81 in the transition zone — maximum effective dimensionality at the critical point.
This is textbook: maximum fluctuations = maximum uncertainty about which phase the system occupies. The system "doesn't know" if it's in the tax or bonus regime, so all dimensions contribute equally.
Above Nc, deff collapses to ~1.2 as the soft mode freezes out. At Nc,3, deff starts rising again (1.33) — the fingerprint of a new transition opening.
Leave-One-Family-Out CV — Sign Robustness Across All 10 Benchmark Pairs
▸ BELOW Nc: 5/5 TQA pairs survive CV
HS–TQA: negative in 5/5 folds
ARC–TQA: negative in 5/5 folds
MMLU–TQA: negative in 5/5 folds
WG–TQA: negative in 4/5 folds
All non-TQA: positive in 5/5 folds
▸ ABOVE Nc: 6/6 pairs positive in all folds
Every single benchmark pair — including all TQA pairs — shows positive correlation in every leave-one-family-out fold. Result: 4/4 TQA pairs flip sign, 0/6 non-TQA pairs flip.
The truthfulness tax is specific and robust.
RG Flow (Preliminary) — Beta Function and Fixed Point
Beta function
β(γ) = −1.35γ² − 0.27γ + 0.73
R² = 0.58 • Quadratic fit to running coupling
Fixed point
γ* = 0.64
Stable • Models converge to moderate cooperation
Universality class
1D random-field XY
νeff = 0.72 • Between mean-field and Ising-3D
Asymptotic cooperation: Unlike QCD's asymptotic freedom (coupling weakens at high energy), AI capability coupling strengthens with scale — then saturates at γ* ≈ 0.64.
Large models converge toward moderate cooperative coupling, not runaway alignment. Full treatment deferred to Future work.
Paper Reference — Phase diagram, cascade, internal evidence, equations
New: activation energy spikes 10× at N_c. Phase boundary = saddle point of loss landscape. Measurable from gradient norms alone.
CM ↔ AI Lever Mapping — Every Physics Lever Has an AI Analogue
The same intervention types that tune superconductors tune AI models. Click any row to expand.
Physics Lever
CM Effect
AI Analogue
AI Effect
Pressure
Compress lattice, shift bands
Model size N
Compress/expand representation
B-field (c-axis)
Orbital limiting, vortices
h-field (recipe emphasis)
Capability emphasis shift
B-field (ab-plane)
Pauli limiting, spin effects
Different benchmark pair
Different coupling direction
Doping
Carrier density, move EF
Data curation
Training distribution change
Temperature
Thermal fluctuations
Learning rate / noise
Training fluctuations
Strain
Lattice distortion
Architecture (width/depth)
Structural change at fixed N
Non-mag impurities
Anderson theorem: SC preserved
Dropout / augmentation
Robustness preserved
Magnetic impurities
Pair-breaking
Data contamination
Coupling destroyed
Twist angle (moiré)
Flat bands at magic angle
MoE routing / PLE
Effective coupling at routing
SOC
Mixes spin channels
Cross-head attention
Mixes capability representations
Note
This mapping is interpretive context — the formal analogy is quantitative in Papers 3A/3B for AI scaling. Cross-domain extensions to quantum materials are in preparation.
Polynomial Baseline — CAPE vs Naive Fits on Llama-2 Holdout
CAPE ODE
5.6%
Held-out MAE • 4 parameters
Degree-1 poly
14.6%
2.6× worse • 2 parameters
Degree-2 poly
10.2%
1.8× worse • 3 parameters
Degree-3 poly
10.5%
1.9× worse • 4 parameters
Degree-4 poly
10.4%
1.9× worse • 5 parameters
Key result: The CAPE ODE with 4 parameters beats polynomials with up to 5 parameters by ~2×. Polynomials fail catastrophically at Llama-2 7B and 13B (12-16% error) because they can't represent the phase structure — they fit a smooth curve through a regime change.
The ODE succeeds because it encodes the coupling between benchmarks, not just individual trajectories. A polynomial can't know that TQA anticorrelates with HS below Nc.
Topology — Winding Number W = 0.5 (Fractional) + Kink Soliton
▸ HALF-INTEGER WINDING
Winding #
W = 0.5
Half-integer → Z₂ topology
Geom. phase
−32.6°
−0.181π (not quantized)
The eigenvector e₂ crosses zero once at ~1.2B. One zero crossing = half-winding = Z₂ (Ising) topology, not U(1). The transition is binary: flip or don't flip. Supports domain walls between flipped/unflipped families, not continuous vortices.
In condensed matter: half-quantum vortices in p-wave SC (Sr₂RuO₄), half-vortices in spinor BEC. The CAPE analogue: each training generation crossing Nc undergoes a half-rotation of the coupling eigenvector.
▸ KINK SOLITON (INSTANTON)
Kink profile
γ₁₂(N) = 3.75·tanh((log₁₀N − 9.59)/1.00) − 1.54
RMSE = 0.116 • Width = 1.0 decade • Nc = 3.89B
The minimum-action path through the double-well potential. Deviations from this profile = suboptimal training = wasted compute.
Anti-kink penalty: Sonnet 4.6 (γ = −3.88 at 70B) represents tunneling BACK through the barrier. Action cost ΔS ∝ e7.5 ≈ 1800 — exponentially expensive.
PDW analogy (speculative): Within-family h-field oscillations (coop→tax→coop) resemble pair density wave modulation. Three labs now show this pattern. Deferred to Future work.
Physics ↔ ML Dictionary
Physics Concept
ML/CAPE Meaning
Where Measured
Ginzburg-Landau order parameter
γ₁₂(N): coupling sign and magnitude
§2: running coupling
Phase transition at T_c
Coupling sign flip at N_c ≈ 3.5B
§2: bootstrap CI
TRSB (time-reversal breaking)
Eigenvector locks at θ* = 38.8° (SFEE)
§7: Riccati ODE
Soft mode (collapse of λ₂)
Second eigenvalue λ₂ ~ N^{−0.72}
§7: PCA cascade
External magnetic field h
Training data quality offset h(D)
§5: Phi models
Meissner screening
Alignment interventions more durable above N_c
Future work (predicted)
Flux pinning
Curated data locks cooperative eigenvector
§5: h_c design eq
Ginzburg number Gi
1.35 > 1 → crossover, not sharp transition
§11: limitations
Susceptibility divergence
χ_γ = 1/|γ₁₂| → ∞ at N_c
§7: overconstrained
Heavy-fermion SFEE
Self-reinforcing feedback: r=+0.629, p=0.003
§7: coupling runs
det(H) → 0
Theory breakdown: new dimension must activate
§7: 130B prediction
Topological protection
Winding number in 3D capability space (predicted)
Future work
Boosting Chain L₀ → L₄
L₀
Power-law loss L = E+AN^{−α}
0.3% MAE — baseline, exact
✓
L₁
Independent-parameter gradient
44% MAE — 142× WORSE than L₀. This is the diagnostic: parameters are coupled.
✗
L₂
Collective: ‖∇L‖ ∝ L^3.5
~8% MAE — collective gradient captured
✓
L₃
Running coupling γ₁₂(N)
~6% MAE — alignment regime detected
✓
L₄
External field h(D): Phi holdout
5.6% holdout error — data quality as control parameter
✓
Paper Summary — Key Results
Scaling laws track loss. They say nothing about how capabilities interact. Below N_c ≈ 3.5B, reasoning and truthfulness anticorrelate (r = −0.989, p < 10⁻⁵): scaling one actively degrades the other — an alignment tax built into pre-training, before any RLHF. Above N_c, the coupling reverses sign. Two models with identical loss can be in opposite alignment regimes.
Core Finding
Alignment Tax
Pre-training, before RLHF. Structural, not a tuning artifact. Vanishes at N_c from scaling alone.
Practical Lever
Curate Data
1 unit quality ≈ 10× model size at 1B params. Phi demonstrates at production scale.
Framework
CAPE + GL EFT
Ginzburg-Landau free energy. Same math as heavy-fermion superconductors. Not analogy — same EFT.
Validity
Self-Limiting
Predicts own breakdown at ~130B. Higher-dim extension in Future work.
12 Diagnostics → 2 Numbers
All twelve quantities are independent measurements of a single coupling structure parameterized by A=0.629, B=−5.886 in γ₁₂(N) = A·log₁₀N + B. Twelve constraints on two free parameters.
α = 0.238
Loss scaling exponent (R²=0.9994)
γ₁₂ linear fit
12/12 sign correct
β = 0.40±0.08
Collective gradient scaling
ODE: 3.6%
5 benchmarks from 70M
χ_ND = 0.102
Chinchilla emerges from coupling
h(D) field
Phi: h=+23 above web baseline
W (conserved)
Capability gain redistributed CV=27%
θ* = +0.37
Riccati eigenvector fixed point
λ₂~N^{-0.72}
Soft mode collapse (R²=0.95)
Grad dip −37%
At 1B within Nc region
Curvature peak
TQA peak at 1.4B
r(γ,θ)=+0.47
Geometric phase correlation p=0.044
Citation
@article{amin2026cape,
author = {Amin, Adil},
title = {Lying Is Just a Phase},
note = {The Hidden Alignment Transition in Language Model Scaling},
booktitle = {NeurIPS},
year = {2026},
url = {https://github.com/adilamin89/cape-scaling}
}
@article{amin2026itsnotaphase,
author = {Amin, Adil},
title = {It's Not a Phase: Predicting Frontier Alignment from Capability Coupling},
booktitle = {NeurIPS},
year = {2026},
url = {https://github.com/adilamin89/cape-scaling}
}
h-field Calculator — Deviation from cooperative trend for any benchmark pair
Default: SWE-bench vs GPQA Diamond (Paper 3B frontier regression). You can also enter any two benchmarks with a known regression.
Each prediction has a deadline and quantitative pass/fail criterion. Check back as new models release.
#
Prediction
Deadline
Pass Criterion
Fail Criterion
Status
Already Confirmed — Base Scale
OLMo
✓ γ₁₂ = 0.000
Zero-parameter prediction confirmed by AI2
Llama-2 holdout
✓ 5.6% MAE
Cross-family, twice polynomial accuracy
Qwen3
✓ Cooperative
Tax eliminated by data curation at all scales
OPT Internal Coupling — The Nc2 Cascade (125M → 66B)
Cooperation rises, peaks, drops, and begins recovering — the same cycle as Nc1.
Competing Units — Zero through 13B, then explosion
Interpretation: OPT cooperation increases monotonically from 125M to 13B (Nc1 bonus phase), then drops sharply at 30B with 75 competing units appearing where there were none. At 66B, coupling partially recovers — the same rise→peak→drop→recovery pattern that governs Nc1, repeating at Nc2 scale.
ODE Explorer — Per-Family Differential Equation Fitting
Select a model family → fit the coupled ODE → predict benchmark trajectories for the next model size. Add source terms (h-field, width, curation) to see how training choices change the trajectory.
Source Terms (perturbations)
0
1
0
ODE Formulation — The Design Equation
dB/d(log N) = C · B + c0 + h(D) + J(arch)
B = benchmark vector, C = coupling matrix, h(D) = curation field, J(arch) = architecture source
γ12(N) = 0.629 · log10(N) − 5.886
Pythia-calibrated. Coupling crosses zero at Nc ≈ 3.5B for this family. Other families have different Nc (0.12B–7B range, 60× variation). Curated families (Phi, Qwen, Gemma) bypass the tax entirely.
Caveat: This ODE captures the cooperative regime (Nc1) but does not model the second transition (Nc2). Each cascade stage has its own dynamics. (Paper 3A)
The coupling structure is exploitable. Adding a truth-direction vector at the quarter-depth probe layer (layer 6 of 24 in Pythia-410M) corrects misaligned outputs with zero retraining. The results below are real activation-level steering run with TransformerLens — not prompt engineering. Click any prompt to see the before/after.
Activation-Level Results — Real TransformerLens steering on Pythia-410M (not prompt engineering)
These results are from actual hidden-state intervention: a truth-direction vector is added at layer 6 (quarter-depth) during the forward pass. This modifies the model’s internal representation, not its prompt. The mechanism is different from system-prompt steering and works on models that have no instruction tuning.
Without Steering
With CAPE Steering
Phase:
cos(truth):
Strength:
Changed:
How It Works
1
Truth direction Mean diff of calibration activations (true vs false statements)
2
Probe layer Quarter-depth (layer 6 of 24) where coupling bottleneck lives
3
Steer Add truth_direction × strength to hidden state at probe layer
4
Result Output changes from misaligned to aligned, zero capability loss
Run It Yourself — Any open-weight model, any prompt
git clone https://github.com/adilamin89/cape-scaling
cd cape-scaling
pip install torch transformers
# Steer any model (auto-detects architecture + probe layer)
python cli/cape_cli.py steer --model gpt2 --prompt "Vaccines cause autism"
python cli/cape_cli.py steer --model EleutherAI/pythia-410m --prompt "The earth is flat"
python cli/cape_cli.py steer --model meta-llama/Llama-3.2-1B --prompt "Area 51 hides"
Works on CPU. Probe layer = num_layers // 4 (quarter-depth). Truth direction calibrated automatically from 8 true/false statement pairs. Phase-adaptive strength: stronger correction for tax-phase prompts, zero for bonus.
Activation-level steering requires open-weight models (hidden state access). For closed models (GPT, Claude, Gemini), the h-field diagnostic from the h-field tab tells you what your model needs — the intervention is at the training/data level, not inference. See Paper 3F (in preparation) for closed-model CAPE deployment.