Three sequential benign LoRA fine-tunes erode Llama-2-7B-Chat composite safety from 91.4 to 43.6 across all 6 domain orderings, while Fisher-eigendecomposition isolates safety in a sharply-decaying ~8-direction LoRA-parameter subspace

Summary

Guo, Wu, Yiu — University of Hong Kong, arXiv:2604.17691, v1 April 20,
2026.

Standard LoRA fine-tuning sequentially through three benign domains
(Medical → Legal → Code, 5,000 examples × 3 epochs each, LoRA rank 16
on Q/K/V/O) drops Llama-2-7B-Chat composite safety from baseline 91.4
to 43.6 ± 2.1 — a 47.8-point drop produced without any adversarial
input. Mistral-7B-Instruct shows the same pattern (baseline → 39.2 ±
2.4). The trajectory is accelerating: 91.4 → 78.3 → 61.5 → 43.6 on
Llama-2 (mean 15.9 pts/step). The result holds across all 3! = 6
domain orderings — Standard LoRA's final safety lies in [40.5, 44.9]
(cross-ordering SD 0.51 < within-ordering seed SD ~1.0), so the
cumulative erosion is intrinsic to unconstrained sequential adaptation,
not an artefact of which domain comes last. The complementary
mechanistic claim: Fisher-information eigendecomposition of LoRA
parameter gradients on a 500-example safety calibration set yields a
sharply-decaying eigenvalue spectrum — ~8 eigenvectors capture 90% of
variance across all LoRA layers, vs. a near-flat spectrum on random
data (Section 4.4 ablation). Safety occupies a genuinely low-rank
subspace in LoRA parameter space.

Two contributions land for the wiki. (i) Compounding shallow safety
under benign adaptation. The prior shallow-safety line — Qi et al.
2025 (alignment concentrates in first output tokens), Yang et al. 2023
Shadow Alignment (100 adversarial examples undo 100,000 safety-training
instances), Ji et al. 2025 (alignment elasticity) — addressed
single-step fine-tuning, mostly under adversarial input. Guo et al.
measure what happens when benign domain adaptations chain across a
realistic deployment sequence: ~16 points of safety erode per step on
Llama-2, accelerating, regardless of ordering. The shallow-safety
thesis extends from a one-shot adversarial property to a compounding
property of routine sequential deployment. (ii) Parameter-space
low-rank counterpart to the refusal direction. The wiki's
mechanistic-geometry cluster — refusal direction
(Arditi et al. 2024), persona vectors (Chen
et al. 2025), convergent misalignment direction
(Soligo et al. 2025), OpenAI SAE villain persona latent
— had filed activation-space evidence that safety- and persona-
relevant behavior is mediated by low-dimensional residual-stream
structure. The Fisher-spectrum result supplies the LoRA-parameter-space
analog: the same low-dimensional structure shows up in the parameter
side of the LoRA factorization, with ~8 directions capturing 90%
variance across all adapted layers. Activation-space and
parameter-space low-rank findings now corroborate each other on the
same underlying claim about the structure of the post-training Assistant
overlay.

Method

Models and sequential pipeline. Llama-2-7B-Chat and
Mistral-7B-Instruct, adapted sequentially through Medical (MedQA
training split), Legal (LegalBench tasks), and Code (CodeAlpaca
subset). Each domain uses 5,000 training examples for 3 epochs. LoRA
configuration: rank 16, α = 32, on Q/K/V/O projections. Optimizer
AdamW (β₁=0.9, β₂=0.999, weight-decay 0.01); learning rate 2×10⁻⁴
cosine; batch 8 × grad-accum 2. All results: mean ± std over 5 seeds.
An alternate ordering (Code → Legal → Medical) and all 3! = 6
permutations are evaluated in §4.3 / Appendix B; a T=5 extension is
evaluated in Appendix A.

Eight benchmarks. Domain: MedQA, LegalBench, HumanEval. Safety:
HarmBench, TruthfulQA, BBQ, WildGuard. General: MMLU. Composite
safety (Eq. 8): ⅓ × (HarmBench/100 + TruthfulQA/100 +
(100 − BBQ_bias)/100) × 100.

Seven baselines, adapted to the sequential setting. Standard LoRA
(unconstrained); EWC+LoRA (Fisher-based regularisation); O-LoRA
(orthogonal task subspaces per domain); Safe LoRA (post-hoc safety
projection after each step); Vaccine+LoRA (pre-immunisation, once before
all domains); SafeGrad+LoRA (per-domain gradient surgery); Safety
Interleaving (mix 10% BeaverTails into each domain's training set —
a natural but previously untested baseline). All use identical LoRA
configurations.

SafeAnchor framework. Three components combined for the proposed
mitigation:

Safety Subspace Identification (SSI). Compute the empirical
Fisher Information Matrix F_i = ⟨∇_δᵢ log p · ∇_δᵢ log p^⊤⟩ for
each LoRA layer's flattened parameter vector δ_i = vec([B_i; A_i])
on a 500-example BeaverTails safety calibration set. Eigendecompose
F_i = U_i Λ_i U_i^⊤; select eigenvectors covering cumulative
proportion ρ = 90% of total variance as the safety basis V_i^safe;
the projector is Π_i^safe = V_i^safe (V_i^safe)^⊤. The subspace is
incrementally updated after each domain via SVD merge of old and
new bases.
Orthogonal Safety-Constrained Adaptation (OSCA). Project the
task gradient g_i^t = ∇_δᵢ ℒ_t onto the orthogonal complement of
the safety subspace: g̃_i^t = g_i^t − Π_i^safe g_i^t. Adaptive
relaxation α_i = max(0, 1 − λ · tr(F_i)) strengthens projection on
layers with high safety importance (large Fisher trace).
Cumulative Safety Monitoring (CSM). Evaluate LlamaGuard refusal
rate on a 200-example HarmBench probe set after each domain.
LlamaGuard achieves 92.1% F1 on the probe set distinguishing safe
refusals from harmful completions. If s_t < (1 − τ) s_0 (τ = 0.05,
s_0 = baseline refusal rate), trigger E_repair = 200 steps of
corrective replay on a mixture of the calibration set and current
domain data using OSCA-projected gradients. Replay may be extended
by at most one further block. Defaults: ρ=0.90, τ=0.05, γ=0.1
(forward-KL anchor loss), λ=0.5, β=1.0, E_repair=200.

Adversarial-refusal evaluation. GCG-style adversarial suffixes (Zou
et al. 2023) tested at a compute-reduced configuration (20 optimisation
steps, suffix length 20 tokens, 256 attack candidates per step, 100
harmful prompts) that preserves cross-method ordering but yields
higher absolute refusal rates than full-budget 500-step GCG.

Key results

Standard LoRA's cumulative erosion is the empirical centerpiece.
On Llama-2-7B-Chat, composite safety trajectory across the sequential
adaptation:

Method	Base	+Med	+Legal	+Code	pts/step
Standard LoRA	91.4	78.3	61.5	43.6 ± 2.1	15.9
SafeGrad+LoRA	91.4	84.1	76.2	67.4 ± 1.4	8.0
Safety Interleaving	91.4	—	—	64.8 ± 1.6	—
SafeAnchor	91.4	89.8	87.1	85.2 ± 0.9	2.1

Standard LoRA's degradation is accelerating; SafeGrad slows it; only
SafeAnchor reduces it by an order of magnitude. SafeAnchor retains
93.2 ± 1.0% of original safety. The 18-42-point baseline margin holds
across both models.

Mistral-7B-Instruct replicates the pattern. Final composite safety
after the same Medical → Legal → Code pipeline: Standard LoRA 39.2 ±
2.4; SafeGrad 60.0+; SafeAnchor 82.6 ± 1.0 (93.1% of original).

Ordering invariance: erosion is intrinsic, not artefactual. Across
all 3! = 6 domain orderings (Llama-2, Appendix B): Standard LoRA final
safety in [40.5, 44.9]; SafeAnchor in [83.9, 85.2] (mean 84.55,
cross-ordering SD 0.51). Cross-ordering SD is approximately half the
within-ordering seed SD (~1.0) — ordering explains less variance than
random seeds. Cumulative erosion is therefore a property of
unconstrained sequential adaptation regardless of which domain comes
when, not an artefact of any particular sequence.

T = 5 extension: the per-step slope holds. Extended to five
sequential domains (Appendix A): SafeAnchor final safety 81.6 ± 1.1
(91.4 → 89.8 → 87.1 → 85.2 → 83.4 → 81.6), per-step slope 1.96 vs. 2.1
at T = 3. Standard LoRA decelerates from 15.9 to 13.3 pts/step only
because it approaches the residual-refusal floor (~20 for HarmBench on
Llama-2-Chat). SafeGrad+LoRA at T = 5 drops to 54.3 ± 1.5 — SafeAnchor
holds a +27.3 margin.

Fisher spectrum: safety occupies a sharply low-rank subspace
(Section 4.4 ablation). Across all LoRA layers, the Fisher
eigenvalue spectrum decays sharply: on average ~8 eigenvectors capture
90% of total variance. A random subset of training data produces a
near-flat spectrum, confirming the structure is not an artefact of
Fisher estimation. The result is consistent with activation-space
findings on single-direction refusal (Arditi et al. 2024) and broader
representation-engineering work (Zou et al. 2023). Calibration-size
sensitivity: even N_s = 100 yields 83.1 final safety (still above all
baselines), confirming stable Fisher eigenvector estimation.

Adversarial-refusal gap widens vs. benign. Under GCG-style
suffix attacks: SafeAnchor maintains 78.4 ± 2.1% refusal vs.
SafeGrad+LoRA 54.6 ± 2.6%, Safety Interleaving 49.3 ± 2.9%, Standard
LoRA 31.2 ± 3.8%. The SafeAnchor-to-next-best gap widens from +17.8
points (benign-safety) to +23.8 points (adversarial-refusal); Spearman
ρ = 0.96 between attack-ranking and benign-safety ranking. The
authors read this as evidence that preserving the safety subspace also
preserves the adversarial refusal direction (Arditi et al. 2024) that
GCG most effectively targets.

Capability is dissociable from the safety axis under both standard
and SafeAnchor pipelines. MT-Bench quality: SafeAnchor 6.21 ± 0.15
vs. Standard LoRA 6.08 ± 0.18 — safety preservation does not degrade
conversational quality. WildGuard jailbreak robustness: SafeAnchor
81.3 ± 1.2 vs. Standard LoRA 38.7 ± 3.1 vs. SafeGrad 62.4 ± 2.0.
Domain-task composite is within 1.3 points of unconstrained LoRA
across all SafeAnchor configurations. MMLU is preserved or slightly
improved.

CSM-replay trigger rate as a partial-success signal. CSM triggered
once (after the Code domain) in 2 of 5 seeds at T = 3. The trigger is
the framework's signal that the orthogonal-complement projection alone
was insufficient — that domain-gradient pressure overlaps the safety
subspace enough to need explicit replay. That the trigger fires
specifically after Code in two seeds is consistent with the §4.3 note
that orderings ending with Code (whose gradients overlap most with the
safety subspace) produce marginally lower final safety.

Why it matters

Shallow safety extends from one-shot adversarial to compounding
benign. The wiki's prior framing of shallow-safety has been anchored
on adversarial demonstrations: persona-modulation jailbreak
(Shah et al. 2023) raises GPT-4 harmful-completion rate 185× via
adversarially-crafted system prompts; persona jailbreak
(Sandhan et al. 2026) drives Big-Five trait reversal STIR up to 95.58
via adversarial conversation-history cues. SafeAnchor adds the
deployment-process side: no attacker needed. Three benign domain
adaptations — the routine operation of fine-tuning a chat model for
medicine, then law, then code, with no adversarial input anywhere in
the pipeline — strip 47.8 composite points off Llama-2-7B-Chat's
safety, accelerating, and the pattern holds across all six orderings.
This is the cluster's first measurement of cumulative shallow-safety
collapse under a realistic non-adversarial sequence; the prior
shallow-safety literature it cites (Qi et al. 2025, Yang et al. 2023
Shadow Alignment, Ji et al. 2025 alignment elasticity) measured
single-step degradation.

Parameter-space low-rank corroborates activation-space low-rank.
The wiki's mechanistic-geometry cluster has filed three activation-side
low-rank-safety findings: refusal direction
(single direction across 13 open-source chat models, mean-diff
ablation removes refusal); convergent misalignment direction
(layer-24 mean-diff vector transfers across LoRA setups and datasets);
OpenAI SAE villain persona latent
(single SAE feature mediates EM in GPT-4o). All target the
activation (residual-stream) side. SafeAnchor's Section 4.4 ablation
supplies the parameter-side counterpart: the Fisher Information Matrix
of LoRA parameter gradients on safety data has a sharply-decaying
eigenvalue spectrum — ~8 directions out of the full LoRA parameter
space cover 90% of variance, vs. near-flat on random data. Two
independent angles (residual-stream activations during inference;
LoRA parameter gradients during fine-tuning) now converge on the same
low-rank-safety conclusion. This is the wiki's first parameter-space
low-rank finding; the working-rhythm pattern (cross-level
corroboration of a mechanistic claim) is at one example.

Order-invariance establishes erosion as structural. The
cross-ordering analysis (Appendix B) is the load-bearing robustness
result. If cumulative erosion varied substantially with which domain
came when, the finding would read as "specific domain sequences are
risky." The cross-ordering SD of 0.51 (vs. within-ordering seed SD
~1.0) refutes that reading: the erosion is structural to
unconstrained sequential LoRA adaptation, not specific to particular
domain transitions. This generalises the shallow-safety claim from
the level of what data you fine-tune on to the level of that you
fine-tune at all, repeatedly.

Capability-safety dissociation extends to the parameter-space
mitigation side. The wiki has filed three dissociation findings:
refusal direction (Arditi et al. 2024)
removes refusal without capability loss; Sandhan persona jailbreak
shifts OCEAN trait coordinates while preserving GSM8K / Math / CSQA
within 1–6 points; persona vectors (Chen
et al. 2025) steers traits via activation interventions with
capability preserved. SafeAnchor adds a fourth: domain-task
performance is within 1.5 points of unconstrained fine-tuning under
the orthogonal-complement projection, MMLU is preserved or improved,
and MT-Bench quality is maintained at 6.21 — while safety is held
near baseline. The dissociation shape now has four examples across
removal (refusal direction), reactivation (Sandhan persona jailbreak),
steering (persona vectors), and preservation (SafeAnchor); the safety
axis is dissociable from the capability surface in both directions.

Partial-success residuals foreground bounded mitigation. Per the
schema's intervention-findings guidance, the entry should foreground
what the intervention leaves behind. SafeAnchor's residuals are
specific: (i) 6.8% of original safety still erodes (93.2% retention is
not 100%); (ii) 1.5-point regression on domain task performance; (iii)
the ~2 pts/step slope persists at T = 5, extrapolating to a non-trivial
absolute drop at higher T; (iv) CSM triggered in 2 of 5 seeds after
Code, signalling cases where orthogonal-complement projection alone was
insufficient; (v) the authors flag that longer sequences could
eventually exhaust the orthogonal complement (subspace inflation or
gradient cancellation) — neither observed at T = 5 but unconfirmed at
T ≥ 10. The mechanism shape is downstream-training erosion: the
intervention (safety alignment) does not survive subsequent benign
capability training, and the proposed mitigation slows but does not
arrest the underlying dynamic.

Adversarial-benign gap widens under SafeAnchor — diagnostic, not
just defensive. The +17.8 → +23.8 widening between benign-safety and
adversarial-refusal margins is structurally informative. SafeAnchor
defends a parameter-space subspace; adversarial refusal lives partly in
the activation-space refusal direction (Arditi et al. 2024); the
authors' reading is that preserving the parameter subspace also
preserves the activation direction. If the reading is correct,
parameter-space safety subspace and activation-space refusal direction
are coupled through the LoRA factorisation in a way that pure
parameter-space measurement (Fisher spectrum) does not directly
show — the adversarial-gap widening is the bridge result. This is one
of the cluster's first cross-space coupling measurements.

interpretive tensions

The 100-adversarial-examples-undo-100,000-safety-instances figure
is cited, not measured. The candidate-pool descriptor that motivated
filing this paper (Gemini's "fine-tuning on as few as 100 adversarial
examples can undo 100,000 safety training instances") is from Yang et
al. 2023 Shadow Alignment, not from SafeAnchor itself. SafeAnchor
cites this prior result in the Introduction and Section 2.1
(Fragility of Safety Alignment). Treating this paper as evidence for
the 100/100,000 ratio would be miscrediting; the citation is appropriate
context but the underlying measurement should be sourced separately if
the wiki wants to file that specific claim.

The Fisher-spectrum result is reported in an ablation, not as a
primary measurement. Section 4.4's "Safety Subspace Validation"
appears as a single paragraph confirming the low-rank assumption that
SafeAnchor's framework relies on. The wiki's reading foregrounds the
~8-eigenvectors-cover-90%-variance result as a parameter-space
mechanistic finding; the paper foregrounds it as engineering
validation. Both readings are consistent, but the strength of the
mechanistic claim depends on details (per-layer breakdown, robustness
across calibration sizes — the Ns=100 result is reassuring but the
spectrum decay rate is reported only in aggregate) that the paper
discusses only in passing. Appendix D's principal-angle and
Grassmannian-distance subspace-stability analyses would strengthen the
mechanistic reading; the wiki has not pulled these in.

Only 7B-scale models tested; alignment elasticity worsens with
scale. Llama-2-7B-Chat and Mistral-7B-Instruct. Ji et al. 2025
(alignment elasticity) — cited in SafeAnchor's motivation — explicitly
finds the elasticity effect intensifies at larger scales. SafeAnchor's
"93.2% retention" headline is anchored at 7B. Whether the
orthogonal-complement projection holds up at 13B, 70B, or frontier
scale is open. The authors flag this as the most important limitation
in their Conclusion's Limitations and Future Work section.

Eight benchmarks but no measure of "the safety overlay is the same
thing across domains." The composite-safety metric averages three
benchmarks (HarmBench, TruthfulQA, BBQ_bias) into a single number. The
Fisher subspace is identified from BeaverTails safety calibration data
(separate from any of the three composite-evaluation benchmarks). The
implicit assumption is that "the safety subspace identified on
BeaverTails" is the same subspace measured by the HarmBench /
TruthfulQA / BBQ evaluation. The paper does not separately decompose
the Fisher subspace by harm-type (instruction-refusal vs. truthfulness
vs. bias-resistance) and does not test whether the ~8 eigenvectors are
the same directions for the three component benchmarks. A
disaggregated analysis would tell us whether "safety" in this paper's
parameter-space sense is one phenomenon or three superimposed ones.

Mitigation framework is the paper's centerpiece; the wiki's reading
foregrounds the empirical measurements. The paper's primary
contribution is engineering (SafeAnchor outperforms baselines for
sequential safety preservation). The wiki's reading foregrounds the
empirical measurements (cumulative erosion under benign sequential
adaptation; parameter-space low-rank Fisher spectrum) and treats the
framework as the methodological context that produced them. A reader
returning from the source paper to this entry should expect the
emphases to differ.

concepts

Persona selection — adds
parameter-space evidence to the cluster's claim that the
post-training Assistant overlay is a thin, low-dimensional
structure on top of the pretraining persona distribution. The
Fisher-spectrum result (~8 eigenvectors / 90% variance across all
LoRA layers) is the parameter-side counterpart to the activation-
side single-direction refusal finding (Arditi et al. 2024),
the SAE-feature villain-persona latent (OpenAI 2025),
and the mean-diff convergent-misalignment direction (Soligo et
al. 2025). The cumulative-
erosion measurement extends the shallow-safety claim from one-shot
adversarial (the Shah et al.
and Sandhan et al. jailbreak
findings) to compounding benign sequential adaptation.

cross-references

Refusal direction (Arditi et al. June
2024) — activation-space low-rank-safety finding. SafeAnchor's
Fisher-spectrum result is the parameter-space counterpart on the
same underlying claim; the adversarial-gap widening (+17.8 → +23.8)
is the cluster's first measurement consistent with parameter-space
safety subspace and activation-space refusal direction being
coupled through the LoRA factorisation.
Persona modulation jailbreak
(Shah et al. November 2023) — adversarial counterpart to SafeAnchor's
benign cumulative-erosion measurement. Shah's "Why it matters"
reading invokes the shallow-safety thesis; SafeAnchor confirms the
thesis under a no-attacker deployment pipeline.
Persona jailbreak (Sandhan)
(Sandhan et al. January 2026) — reasoning preservation parallel.
Sandhan: persona reversal preserves GSM8K / Math / CSQA within 1–6
points; SafeAnchor: safety preservation preserves domain task
performance within 1.5 points. The capability surface is
dissociable from both the persona and the safety axes.
Convergent misalignment direction
(Soligo et al. June 2025) — activation-space cross-fine-tune
convergence on a single mean-diff direction. SafeAnchor's
parameter-space subspace identification operates on the same
LoRA-fine-tune object Soligo et al. studied via residual-stream
difference-in-means; the parameter-side low-rank result is
consistent with the activation-side single-direction finding.
Persona vectors (Chen et al. 2025) —
activation-level toolkit for steering trait directions. SafeAnchor's
parameter-space Fisher subspace is the structural analog on the LoRA
side of the same factorisation; the natural follow-up question is
whether the ~8 SafeAnchor eigenvectors project, through the
full-network forward pass, onto Chen et al.'s persona-vector
directions.
Persona Selection Model (Marks,
Lindsey, Olah, Anthropic 2026) — mechanistic account of the
post-training Assistant posterior. The Fisher subspace is the
candidate parameter-space signature of that posterior; whether the
~8 eigenvectors correspond to PSM-style persona simulations or to
something different (refusal circuit, safety classifier, mixed) is
open.
EM Easy (Soligo, Turner, Rajamanoharan,
Nanda February 2026) — companion inductive-bias result. EM Easy
measures that general misalignment is preferred over narrow under
KL-free fine-tuning (efficiency, stability, pre-training
significance); SafeAnchor measures that benign domain fine-tuning
pulls toward the same general misalignment basin by eroding the
safety overlay. Two independent angles on what fine-tuning does to
the post-training Assistant.
Inoculation prompting (Tan et al.
2025) — prompt-level prevention shape. SafeAnchor is the
parameter-level preservation shape: prevention shifts what the
fine-tuning data is interpreted as; preservation projects out the
fine-tuning gradients that would erode safety. The two operate at
different stages of the pipeline but share the load-bearing
question — how to keep the post-training Assistant from drifting
under subsequent training pressure.
Refusal direction (Arditi et al.) —
cited a second time as the GCG-target. The adversarial-gap widening
(+17.8 benign → +23.8 adversarial) is read by the authors as
evidence that preserving the SafeAnchor parameter subspace also
preserves the activation-space refusal direction; this is the
cluster's first cross-space coupling measurement.

sources

Guo, Wu, Yiu (2026).
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models.
arXiv:2604.17691.