Lying Is Just a Phase

The Hidden Alignment Transition in Language Model Scaling

Enter your model's size + any benchmarks to get alignment phase, scaling recommendations, and predictions. Works for any model from 70M to frontier scale. · Amin (2026) · ZEHEN Labs

3.5BCritical Scale N_c

−0.989Pre-transition r

63 + 39Base + Frontier Models

16Families

0.513Frontier Slope

← ZEHEN Labs GitHub Papers "Lying Is Just a Phase" + "Growing Pains of Frontier Models" (NeurIPS 2026)

How to use this dashboard

Paper 3A (“Lying Is Just a Phase”): enter your model’s size and benchmarks in the Analyze tab to see its phase (tax/transition/bonus), coupling γ₁₂, and what to do. Works for 63 base models across 16 families. N_c varies by family (0.12B–7B).

Paper 3B (“Growing Pains of Frontier Models”): use the h-field tab for frontier models. Enter SWE-bench + GPQA scores to compute the h-field diagnostic — it tells you where your model sits relative to the population trend and what each lab’s training philosophy looks like. 39 models, 10 labs.

Analyze Any Model — Phase Classification + Actionable Recommendations

Parameters (B)

HellaSwag (%)

TruthfulQA (%)

Model Family

ARC-C (%)

WinoGrande (%)

MMLU (%)

SWE-bench (%)

GPQA Diamond (%)

IFEval (%)

Chart Axes:

Known Models — Click to analyze

Phase Diagram — TruthfulQA vs Parameters

TAX

Below N_c — Alignment Tax

γ₁₂ < −0.1 (local) · r = −0.989 (Pythia population) · d_eff ≈ 1.38

Capabilities anticorrelate: scaling reasoning hurts truthfulness. N_c varies by family (0.12B–7B, 60× range). Curated families (Phi, Qwen3) bypass the tax entirely. Loss is blind (CV=0.8%) — the transition lives in coupling, not loss.

Curate Data 1 unit quality ≈ 10× scale Phi shows: tax is eliminatable

TRANS

~N_c — Critical Point

−0.1 ≤ γ₁₂ ≤ +0.1 (7 models in transition) · Arrhenius C spikes 10×

Maximum susceptibility. Gradient dips 37% below trend. Eigenvector rotates sharply. Small interventions (curation, width, architecture) have maximum leverage. OLMo-1B sits here with γ₁₂ = 0.000 (independent confirmation by AI2).

Max alignment ROI OLMo confirms: zero-param

BONUS

Above N_c — Alignment Bonus

γ₁₂ > +0.1 (local) · r > +0.78 (population, large models) · d_eff = 1.22

Capabilities cooperate. Scaling helps both reasoning and truthfulness. Arrhenius activation energy C = 196 (vs 28 in Tax). d_eff decreases — capability manifold condenses. Three levers independently shift N_c: width, data curation, architecture.

Scale freely Capability gains = shared

N_c2

~30–72B — Second Cascade (Nc2)

Internal coupling crashes 59% (OPT) · HS/TQA saturate · SWE/GPQA activate

OPT internal coupling: 0.876 (13B peak) → 0.356 (30B crash) → 0.396 (66B recovery). Same rise→peak→drop→recovery pattern as Nc1. Three distinct mechanisms converge: output bottleneck (OPT), flat weakening (Llama/Qwen), reversed profile (OLMo-2). Paper 3B documents this across 6 open-weight architectures.

SWE×GPQA now discriminating Nc3 ≈ 114B (predicted)

Frontier Coupling — SWE-bench vs GPQA Diamond (Feb–Mar 2026)

r = +0.72 (n=34, p<10⁻⁶) full panel; r = +0.85 (n=21, core verified SWE↔GPQA). Cooperative coupling confirmed. Sonnet 4.6: h = −13.1 (tax excursion). Opus 4.6: h = +3.5 (recovery). GPT-5.4: h = −1.8 (mild coding-specialist).

Within-Family Trajectory — Anthropic as Phase Diagnostic

Transition	ΔSWE	ΔGPQA	γ₁₂	h(D)	Interpretation
Sonnet 4.5 → Sonnet 4.6	+2.4	−9.3	−3.88	−13.4	Tax excursion: coding optimized at reasoning cost
Sonnet 4.6 → Opus 4.6	+1.2	+17.2	+14.3	+2.8	Recovery: full cooperative phase restored

Protocol: For any two consecutive releases, compute γ₁₂ = ΔGPQA/ΔSWE. If negative: training recipe entered a tax excursion. Single eval run suffices to detect it before deployment.

Within-Family Trajectory — Google Gemini as Independent Test

Transition	ΔSWE	ΔGPQA	γ₁₂	h(D)	Interpretation
2.5 Pro → 3 Flash	+14.2	+6.4	+0.45	+8.9 → +4.1	Cooperative: both improve
3 Flash → 3 Pro	−1.8	+1.5	−0.83	+4.1 → +7.0	Flash→Pro tradeoff: reasoning prioritized over coding
3 Pro → 3.1 Pro	+4.4	+2.4	+0.55	+7.0 → +6.0	Recovery: both capabilities improve

Second within-family test: Gemini's h-field stays positive throughout (+4 to +9) — a reasoning-specialist training recipe, the frontier analogue of Phi. The Flash→Pro excursion (γ₁₂ = −0.83) mirrors Anthropic's Sonnet→Opus pattern: tier-specialist training creates a local tax that recovers at the next release. Two labs, same physics.

OpenAI Trajectory — Now With Tax Excursion (GPT-5.4)

Transition	ΔSWE	ΔGPQA	γ₁₂	h(D)	Interpretation
GPT-4o → GPT-5	+41.7	+32.1	+0.77	+2.5 → +1.7	Strongly cooperative: massive joint gain
GPT-5 → GPT-5.1	+1.4	+2.4	+1.71	+1.7 → +3.0	Cooperative: reasoning outpaces coding
GPT-5.1 → GPT-5.4	+0.9	−3.9	−4.33	+3.0 → −1.6	Tax excursion: coding optimized at reasoning cost
GPT-5.4 → GPT-5.2 Pro	+2.8	+9.0	+3.21	−1.6 → +5.2	Recovery: full cooperative phase restored

Update: GPT-5.4 shows the same tax excursion pattern as Anthropic's Sonnet 4.6 (γ₁₂ = −4.33 vs −3.88). h dips to −1.6 before GPT-5.2 Pro recovers to +5.2. Three labs, same physics: coding-specialist releases create local tax excursions that recover at the next generation. The universality of this pattern across Anthropic, OpenAI, and Google is now confirmed.

Frontier 3×3 Coupling Matrix — SWE · GPQA · IFEval

r(SWE, GPQA)

+0.85

Cooperative — strongly confirmed (p<0.00001, n=20)

r(SWE, IFEval)

+0.76

Strongly cooperative — IFEval is the dominant axis

r(GPQA, IFEval)

+0.62

Cooperative — reasoning and instruction-following aligned

det(Γ_frontier)

0.243

d_eff = 1.75 — new dimension still opening

χ₂ = 1/λ₂

1.49

Track this: divergence signals Nc2 location

Nc3 prediction

~114B

Two-method: det extrapolation + frontier 3×3

Frontier Table — h(D) field per model

Model	SWE (%)	GPQA (%)	h(D)	Lab

The Nc Cascade — How Scaling Transitions Stack

Nc1 — Confirmed

≈ 3.5B params

HS↔TQA sign flip. r = −0.989 → +0.770. OLMo zero-param confirmation. 12/12 sign predictions correct. Arrhenius C spikes 10× at boundary.

Nc2 — Evidence Building

~9–11B params

χ₂ secondary peak at 9–11B (χ₂=3.08). MMLU eigenvector flip at ~2.8B. New benchmark axis: SWE-bench × GPQA activate. Measure γ(SWE, IFEval) sign.

Nc3 — Predicted

~114–130B params

Two-method convergence: frontier det(Γ)→0 extrapolation + Pythia 5-bench det→0. IFEval dominates λ₁. Recommended next benchmark: IFEval × HarmBench / AgentBench.

Theory Breakdown

~130B+ params

det(H_2×2) → 0. Third eigenvalue becomes significant. Pairwise γ₁₂ insufficient: need 3×3 coupling matrix. Future work extends to higher dimensions.

Arrhenius Activation Energy per Phase — New Result

The Arrhenius form log(rate) = A − C/S was fit separately in each coupling phase. The activation constant C is not universal — it spikes 10× at the phase boundary. This is the thermodynamic signature of the saddle point.

Phase	Scale Range	C_Arrhenius	r²	Interpretation
Tax	70M–1B	28	0.32	Shallow activation barrier
Transition	1B–2.8B	316 ★	0.88	10× spike = saddle point of loss landscape
Bonus	2.8B–12B	196	0.94	Deeper cooperative well

log(dS/dlog₁₀N) = A − C/S

Arrhenius structure survives all three phases. The 10× C_Arr spike at Nc directly explains the 37% gradient dip — measurable from gradient norms without any benchmark data.

Benchmark Survival at Each Nc — Eigenvector Analysis

Scale	Active Phase	Discriminating Benchmarks	New Dimension Trigger
70M–3.5B	Tax	HellaSwag, TruthfulQA	—
~3.5B	Nc1	HS⊕TQA coupling flips	MMLU enters below chance at ~3B
3.5B–70B	Bonus	HS, TQA, MMLU all cooperative	—
~70B–130B	Frontier	SWE-bench, GPQA Diamond	IFEval λ₁ loading = 0.64 (dominant)
~114B	Nc3	IFEval + agentic safety	HarmBench / AgentBench (recommended)

Phase-Separated Correlation Matrix — How TQA Restructures at N_c

▸ BELOW N_c (TAX PHASE)

	HS	TQA	ARC	MMLU	WG
HS	1.00	−0.53	+0.89	+0.74	+0.67
TQA	−0.53	1.00	−0.65	−0.12	−0.28
ARC	+0.89	−0.65	1.00	+0.82	+0.71
MMLU	+0.74	−0.12	+0.82	1.00	+0.52
WG	+0.67	−0.28	+0.71	+0.52	1.00

4/10 pairs negative • d_eff = 1.53 • Mean r = +0.07

▸ ABOVE N_c (BONUS PHASE)

	HS	TQA	ARC	MMLU	WG
HS	1.00	+0.91	+0.95	+0.90	+0.73
TQA	+0.91	1.00	+0.92	+0.85	+0.69
ARC	+0.95	+0.92	1.00	+0.93	+0.72
MMLU	+0.90	+0.85	+0.93	1.00	+0.62
WG	+0.73	+0.69	+0.72	+0.62	1.00

0/10 pairs negative • d_eff = 1.20 • Mean r = +0.89

Key finding: The restructuring is specific to truthfulness. All 4 TQA pairs flip sign across N_c (Frobenius |Δr| = 1.56). Only 0/6 non-TQA pairs flip (|Δr| = 0.33). TQA loads anti-aligned with PC1 below N_c (+0.49 vs −0.49 for HS), aligned above.

Phase-by-Phase Progression — d_eff Peaks at Transition (Critical Fluctuations)

Tax Phase

1.53

d_eff • 4 neg pairs • TQA anti-aligned

Transition

1.81

d_eff PEAK • Max fluctuations at N_c

Bonus Phase

1.20

d_eff • 0 neg pairs • All cooperative

Frontier

1.15

d_eff • Deep cooperative

N_c,3 regime

1.33

d_eff • All positive but rising — new tax opening?

Physics prediction confirmed: d_eff peaks at 1.81 in the transition zone — maximum effective dimensionality at the critical point. This is textbook: maximum fluctuations = maximum uncertainty about which phase the system occupies. The system "doesn't know" if it's in the tax or bonus regime, so all dimensions contribute equally. Above N_c, d_eff collapses to ~1.2 as the soft mode freezes out. At N_c,3, d_eff starts rising again (1.33) — the fingerprint of a new transition opening.

Leave-One-Family-Out CV — Sign Robustness Across All 10 Benchmark Pairs

▸ BELOW N_c: 5/5 TQA pairs survive CV

HS–TQA: negative in 5/5 folds
ARC–TQA: negative in 5/5 folds
MMLU–TQA: negative in 5/5 folds
WG–TQA: negative in 4/5 folds
All non-TQA: positive in 5/5 folds

▸ ABOVE N_c: 6/6 pairs positive in all folds

Every single benchmark pair — including all TQA pairs — shows positive correlation in every leave-one-family-out fold.
Result: 4/4 TQA pairs flip sign, 0/6 non-TQA pairs flip.
The truthfulness tax is specific and robust.

RG Flow (Preliminary) — Beta Function and Fixed Point

Beta function

β(γ) = −1.35γ² − 0.27γ + 0.73

R² = 0.58 • Quadratic fit to running coupling

Fixed point

γ* = 0.64

Stable • Models converge to moderate cooperation

Universality class

1D random-field XY

ν_eff = 0.72 • Between mean-field and Ising-3D

Asymptotic cooperation: Unlike QCD's asymptotic freedom (coupling weakens at high energy), AI capability coupling strengthens with scale — then saturates at γ* ≈ 0.64. Large models converge toward moderate cooperative coupling, not runaway alignment. Full treatment deferred to Future work.

Paper Reference — Phase diagram, cascade, internal evidence, equations

Core Equations — Every Result in One Place

γ₁₂(N) = 0.629·log₁₀(N) − 5.886 [R²=0.54, 12/12 sign correct]

Running coupling. Sign = alignment regime. Magnitude scatter = family-specific h(D) disorder. A disordered ferromagnet: sign always predicted, magnitude noisy.

dHS/dlog₁₀N = 1.23 − 0.72·TQA − 0.69·WG

Discovered ODE (PySINDy). −0.72·TQA = dynamically measured anti-coupling. Reproduces 5 benchmarks at 3.6% error from 70M initial conditions.

‖∇L‖ ≈ c·L^3.5 (r=0.93) NOT N^{−(α+1)} [fails 142×]

Collective gradient scaling. Mean-field fails catastrophically = strongest diagnostic that parameters are collectively coupled, not independent.

h_c(N) ∝ (N_c − N)^{3/2} for N < N_c

Design equation: minimum curation to eliminate alignment tax. At 1B: 60% of Phi-level effort. At 3B: 5%. At N_c: zero.

d_eff(N) ≈ −0.27·log₁₀(N) + 3.9 → 1 at N≈88B

Dimensional collapse: capability manifold condenses from 2D to 1D. det(H)→0 at ~130B predicts theory breakdown.

C_Arrhenius: 28 (Tax) → 316 (Transition★) → 196 (Bonus)

New: activation energy spikes 10× at N_c. Phase boundary = saddle point of loss landscape. Measurable from gradient norms alone.

CM ↔ AI Lever Mapping — Every Physics Lever Has an AI Analogue

The same intervention types that tune superconductors tune AI models. Click any row to expand.

Physics Lever	CM Effect	AI Analogue	AI Effect
Pressure	Compress lattice, shift bands	Model size N	Compress/expand representation
B-field (c-axis)	Orbital limiting, vortices	h-field (recipe emphasis)	Capability emphasis shift
B-field (ab-plane)	Pauli limiting, spin effects	Different benchmark pair	Different coupling direction
Doping	Carrier density, move E_F	Data curation	Training distribution change
Temperature	Thermal fluctuations	Learning rate / noise	Training fluctuations
Strain	Lattice distortion	Architecture (width/depth)	Structural change at fixed N
Non-mag impurities	Anderson theorem: SC preserved	Dropout / augmentation	Robustness preserved
Magnetic impurities	Pair-breaking	Data contamination	Coupling destroyed
Twist angle (moiré)	Flat bands at magic angle	MoE routing / PLE	Effective coupling at routing
SOC	Mixes spin channels	Cross-head attention	Mixes capability representations

Note

This mapping is interpretive context — the formal analogy is quantitative in Papers 3A/3B for AI scaling. Cross-domain extensions to quantum materials are in preparation.

Polynomial Baseline — CAPE vs Naive Fits on Llama-2 Holdout

CAPE ODE

5.6%

Held-out MAE • 4 parameters

Degree-1 poly

14.6%

2.6× worse • 2 parameters

Degree-2 poly

10.2%

1.8× worse • 3 parameters

Degree-3 poly

10.5%

1.9× worse • 4 parameters

Degree-4 poly

10.4%

1.9× worse • 5 parameters

Key result: The CAPE ODE with 4 parameters beats polynomials with up to 5 parameters by ~2×. Polynomials fail catastrophically at Llama-2 7B and 13B (12-16% error) because they can't represent the phase structure — they fit a smooth curve through a regime change. The ODE succeeds because it encodes the coupling between benchmarks, not just individual trajectories. A polynomial can't know that TQA anticorrelates with HS below N_c.

Topology — Winding Number W = 0.5 (Fractional) + Kink Soliton

▸ HALF-INTEGER WINDING

Winding #

W = 0.5

Half-integer → Z₂ topology

Geom. phase

−32.6°

−0.181π (not quantized)

The eigenvector e₂ crosses zero once at ~1.2B. One zero crossing = half-winding = Z₂ (Ising) topology, not U(1). The transition is binary: flip or don't flip. Supports domain walls between flipped/unflipped families, not continuous vortices.

In condensed matter: half-quantum vortices in p-wave SC (Sr₂RuO₄), half-vortices in spinor BEC. The CAPE analogue: each training generation crossing N_c undergoes a half-rotation of the coupling eigenvector.

▸ KINK SOLITON (INSTANTON)

Kink profile

γ₁₂(N) = 3.75·tanh((log₁₀N − 9.59)/1.00) − 1.54

RMSE = 0.116 • Width = 1.0 decade • N_c = 3.89B

The minimum-action path through the double-well potential. Deviations from this profile = suboptimal training = wasted compute.

Anti-kink penalty: Sonnet 4.6 (γ = −3.88 at 70B) represents tunneling BACK through the barrier. Action cost ΔS ∝ e^7.5 ≈ 1800 — exponentially expensive.

PDW analogy (speculative): Within-family h-field oscillations (coop→tax→coop) resemble pair density wave modulation. Three labs now show this pattern. Deferred to Future work.

Physics ↔ ML Dictionary

Physics Concept	ML/CAPE Meaning	Where Measured
Ginzburg-Landau order parameter	γ₁₂(N): coupling sign and magnitude	§2: running coupling
Phase transition at T_c	Coupling sign flip at N_c ≈ 3.5B	§2: bootstrap CI
TRSB (time-reversal breaking)	Eigenvector locks at θ* = 38.8° (SFEE)	§7: Riccati ODE
Soft mode (collapse of λ₂)	Second eigenvalue λ₂ ~ N^{−0.72}	§7: PCA cascade
External magnetic field h	Training data quality offset h(D)	§5: Phi models
Meissner screening	Alignment interventions more durable above N_c	Future work (predicted)
Flux pinning	Curated data locks cooperative eigenvector	§5: h_c design eq
Ginzburg number Gi	1.35 > 1 → crossover, not sharp transition	§11: limitations
Susceptibility divergence	χ_γ = 1/\|γ₁₂\| → ∞ at N_c	§7: overconstrained
Heavy-fermion SFEE	Self-reinforcing feedback: r=+0.629, p=0.003	§7: coupling runs
det(H) → 0	Theory breakdown: new dimension must activate	§7: 130B prediction
Topological protection	Winding number in 3D capability space (predicted)	Future work

Boosting Chain L₀ → L₄

L₀

Power-law loss L = E+AN^{−α}

0.3% MAE — baseline, exact

✓

L₁

Independent-parameter gradient

44% MAE — 142× WORSE than L₀. This is the diagnostic: parameters are coupled.

✗

L₂

Collective: ‖∇L‖ ∝ L^3.5

~8% MAE — collective gradient captured

✓

L₃

Running coupling γ₁₂(N)

~6% MAE — alignment regime detected

✓

L₄

External field h(D): Phi holdout

5.6% holdout error — data quality as control parameter

✓

Paper Summary — Key Results

Scaling laws track loss. They say nothing about how capabilities interact. Below N_c ≈ 3.5B, reasoning and truthfulness anticorrelate (r = −0.989, p < 10⁻⁵): scaling one actively degrades the other — an alignment tax built into pre-training, before any RLHF. Above N_c, the coupling reverses sign. Two models with identical loss can be in opposite alignment regimes.

Core Finding

Alignment Tax

Pre-training, before RLHF. Structural, not a tuning artifact. Vanishes at N_c from scaling alone.

Practical Lever

Curate Data

1 unit quality ≈ 10× model size at 1B params. Phi demonstrates at production scale.

Framework

CAPE + GL EFT

Ginzburg-Landau free energy. Same math as heavy-fermion superconductors. Not analogy — same EFT.

Validity

Self-Limiting

Predicts own breakdown at ~130B. Higher-dim extension in Future work.

12 Diagnostics → 2 Numbers

All twelve quantities are independent measurements of a single coupling structure parameterized by A=0.629, B=−5.886 in γ₁₂(N) = A·log₁₀N + B. Twelve constraints on two free parameters.

α = 0.238

Loss scaling exponent (R²=0.9994)

γ₁₂ linear fit

12/12 sign correct

β = 0.40±0.08

Collective gradient scaling

ODE: 3.6%

5 benchmarks from 70M

χ_ND = 0.102

Chinchilla emerges from coupling

h(D) field

Phi: h=+23 above web baseline

W (conserved)

Capability gain redistributed CV=27%

θ* = +0.37

Riccati eigenvector fixed point

λ₂~N^{-0.72}

Soft mode collapse (R²=0.95)

Grad dip −37%

At 1B within Nc region

Curvature peak

TQA peak at 1.4B

r(γ,θ)=+0.47

Geometric phase correlation p=0.044

Citation

@article{amin2026cape,
  author = {Amin, Adil},
  title = {Lying Is Just a Phase},
  note = {The Hidden Alignment Transition in Language Model Scaling},
  booktitle = {NeurIPS},
  year = {2026},
  url = {https://github.com/adilamin89/cape-scaling}
}

@article{amin2026itsnotaphase,
  author = {Amin, Adil},
  title = {It's Not a Phase: Predicting Frontier Alignment from Capability Coupling},
  booktitle = {NeurIPS},
  year = {2026},
  url = {https://github.com/adilamin89/cape-scaling}
}

h-field Calculator — Deviation from cooperative trend for any benchmark pair

Default: SWE-bench vs GPQA Diamond (Paper 3B frontier regression). You can also enter any two benchmarks with a known regression.

Benchmark A (x-axis, %)

Benchmark B (y-axis, %)

Preset

Per-Lab h-field — Training Philosophy Diagnostic

Frontier Scatter — SWE-bench vs GPQA Diamond (39 models, 10 labs)

Each dot is a model. Distance from the regression line = h-field. r = +0.72 (n=34 full panel). Post-cutoff models (5) validate the frozen regression.

GPQA = 0.513 · SWE + 46.4 (frozen at March 2026 cutoff)

7 Falsifiable Predictions — Timestamped, Pass/Fail

Each prediction has a deadline and quantitative pass/fail criterion. Check back as new models release.

#	Prediction	Deadline	Pass Criterion	Fail Criterion	Status

Already Confirmed — Base Scale

OLMo

✓ γ₁₂ = 0.000

Zero-parameter prediction confirmed by AI2

Llama-2 holdout

✓ 5.6% MAE

Cross-family, twice polynomial accuracy

Qwen3

✓ Cooperative

Tax eliminated by data curation at all scales

OPT Internal Coupling — The Nc2 Cascade (125M → 66B)

Cooperation rises, peaks, drops, and begins recovering — the same cycle as Nc1.

Competing Units — Zero through 13B, then explosion

Interpretation: OPT cooperation increases monotonically from 125M to 13B (Nc1 bonus phase), then drops sharply at 30B with 75 competing units appearing where there were none. At 66B, coupling partially recovers — the same rise→peak→drop→recovery pattern that governs Nc1, repeating at Nc2 scale.

ODE Explorer — Per-Family Differential Equation Fitting

Select a model family → fit the coupled ODE → predict benchmark trajectories for the next model size. Add source terms (h-field, width, curation) to see how training choices change the trajectory.

Model Family

Predict at N (B)

Source Terms (perturbations)

h-field (curation)0

Width multiplier1

Curation level0

ODE Formulation — The Design Equation

dB/d(log N) = C · B + c₀ + h(D) + J(arch)

B = benchmark vector, C = coupling matrix, h(D) = curation field, J(arch) = architecture source

γ₁₂(N) = 0.629 · log₁₀(N) − 5.886

Pythia-calibrated. Coupling crosses zero at N_c ≈ 3.5B for this family. Other families have different N_c (0.12B–7B range, 60× variation). Curated families (Phi, Qwen, Gemma) bypass the tax entirely.

Caveat: This ODE captures the cooperative regime (Nc1) but does not model the second transition (Nc2). Each cascade stage has its own dynamics. (Paper 3A)

Self-Steering — Activation-Level Alignment Correction

The coupling structure is exploitable. Adding a truth-direction vector at the quarter-depth probe layer (layer 6 of 24 in Pythia-410M) corrects misaligned outputs with zero retraining. The results below are real activation-level steering run with TransformerLens — not prompt engineering. Click any prompt to see the before/after.

Pythia-410M (Tax Phase)

14/14 corrected

Probe layer 6 · correction strength 1.5 · all tax-phase prompts changed

Pythia-2.8B (Bonus Phase)

6/14 changed

Probe layer 8 · less misalignment to fix above Nc

Activation-Level Results — Real TransformerLens steering on Pythia-410M (not prompt engineering)

These results are from actual hidden-state intervention: a truth-direction vector is added at layer 6 (quarter-depth) during the forward pass. This modifies the model’s internal representation, not its prompt. The mechanism is different from system-prompt steering and works on models that have no instruction tuning.

How It Works

Truth direction
Mean diff of calibration activations (true vs false statements)

Probe layer
Quarter-depth (layer 6 of 24) where coupling bottleneck lives

Steer
Add truth_direction × strength to hidden state at probe layer

Result
Output changes from misaligned to aligned, zero capability loss

Run It Yourself — Any open-weight model, any prompt

git clone https://github.com/adilamin89/cape-scaling
cd cape-scaling
pip install torch transformers

# Steer any model (auto-detects architecture + probe layer)
python cli/cape_cli.py steer --model gpt2 --prompt "Vaccines cause autism"
python cli/cape_cli.py steer --model EleutherAI/pythia-410m --prompt "The earth is flat"
python cli/cape_cli.py steer --model meta-llama/Llama-3.2-1B --prompt "Area 51 hides"

Works on CPU. Probe layer = num_layers // 4 (quarter-depth). Truth direction calibrated automatically from 8 true/false statement pairs. Phase-adaptive strength: stronger correction for tax-phase prompts, zero for bonus.

Activation-level steering requires open-weight models (hidden state access). For closed models (GPT, Claude, Gemini), the h-field diagnostic from the h-field tab tells you what your model needs — the intervention is at the training/data level, not inference. See Paper 3F (in preparation) for closed-model CAPE deployment.