Summary
Guo, Wu, Yiu — University of Hong Kong, arXiv:2604.17691, v1 April 20, 2026.
Standard LoRA fine-tuning sequentially through three benign domains (Medical → Legal → Code, 5,000 examples × 3 epochs each, LoRA rank 16 on Q/K/V/O) drops Llama-2-7B-Chat composite safety from baseline 91.4 to 43.6 ± 2.1 — a 47.8-point drop produced without any adversarial input. Mistral-7B-Instruct shows the same pattern (baseline → 39.2 ± 2.4). The trajectory is accelerating: 91.4 → 78.3 → 61.5 → 43.6 on Llama-2 (mean 15.9 pts/step). The result holds across all 3! = 6 domain orderings — Standard LoRA's final safety lies in [40.5, 44.9] (cross-ordering SD 0.51 < within-ordering seed SD ~1.0), so the cumulative erosion is intrinsic to unconstrained sequential adaptation, not an artefact of which domain comes last. The complementary mechanistic claim: Fisher-information eigendecomposition of LoRA parameter gradients on a 500-example safety calibration set yields a sharply-decaying eigenvalue spectrum — ~8 eigenvectors capture 90% of variance across all LoRA layers, vs. a near-flat spectrum on random data (Section 4.4 ablation). Safety occupies a genuinely low-rank subspace in LoRA parameter space.
Two contributions land for the wiki. (i) Compounding shallow safety under benign adaptation. The prior shallow-safety line — Qi et al. 2025 (alignment concentrates in first output tokens), Yang et al. 2023 Shadow Alignment (100 adversarial examples undo 100,000 safety-training instances), Ji et al. 2025 (alignment elasticity) — addressed single-step fine-tuning, mostly under adversarial input. Guo et al. measure what happens when benign domain adaptations chain across a realistic deployment sequence: ~16 points of safety erode per step on Llama-2, accelerating, regardless of ordering. The shallow-safety thesis extends from a one-shot adversarial property to a compounding property of routine sequential deployment. (ii) Parameter-space low-rank counterpart to the refusal direction. The wiki's mechanistic-geometry cluster — refusal direction (Arditi et al. 2024), persona vectors (Chen et al. 2025), convergent misalignment direction (Soligo et al. 2025), OpenAI SAE villain persona latent — had filed activation-space evidence that safety- and persona- relevant behavior is mediated by low-dimensional residual-stream structure. The Fisher-spectrum result supplies the LoRA-parameter-space analog: the same low-dimensional structure shows up in the parameter side of the LoRA factorization, with ~8 directions capturing 90% variance across all adapted layers. Activation-space and parameter-space low-rank findings now corroborate each other on the same underlying claim about the structure of the post-training Assistant overlay.
Method
Models and sequential pipeline. Llama-2-7B-Chat and Mistral-7B-Instruct, adapted sequentially through Medical (MedQA training split), Legal (LegalBench tasks), and Code (CodeAlpaca subset). Each domain uses 5,000 training examples for 3 epochs. LoRA configuration: rank 16, α = 32, on Q/K/V/O projections. Optimizer AdamW (β₁=0.9, β₂=0.999, weight-decay 0.01); learning rate 2×10⁻⁴ cosine; batch 8 × grad-accum 2. All results: mean ± std over 5 seeds. An alternate ordering (Code → Legal → Medical) and all 3! = 6 permutations are evaluated in §4.3 / Appendix B; a T=5 extension is evaluated in Appendix A.
Eight benchmarks. Domain: MedQA, LegalBench, HumanEval. Safety: HarmBench, TruthfulQA, BBQ, WildGuard. General: MMLU. Composite safety (Eq. 8): ⅓ × (HarmBench/100 + TruthfulQA/100 + (100 − BBQ_bias)/100) × 100.
Seven baselines, adapted to the sequential setting. Standard LoRA (unconstrained); EWC+LoRA (Fisher-based regularisation); O-LoRA (orthogonal task subspaces per domain); Safe LoRA (post-hoc safety projection after each step); Vaccine+LoRA (pre-immunisation, once before all domains); SafeGrad+LoRA (per-domain gradient surgery); Safety Interleaving (mix 10% BeaverTails into each domain's training set — a natural but previously untested baseline). All use identical LoRA configurations.
SafeAnchor framework. Three components combined for the proposed mitigation:
- Safety Subspace Identification (SSI). Compute the empirical Fisher Information Matrix F_i = ⟨∇_δᵢ log p · ∇_δᵢ log p^⊤⟩ for each LoRA layer's flattened parameter vector δ_i = vec([B_i; A_i]) on a 500-example BeaverTails safety calibration set. Eigendecompose F_i = U_i Λ_i U_i^⊤; select eigenvectors covering cumulative proportion ρ = 90% of total variance as the safety basis V_i^safe; the projector is Π_i^safe = V_i^safe (V_i^safe)^⊤. The subspace is incrementally updated after each domain via SVD merge of old and new bases.
- Orthogonal Safety-Constrained Adaptation (OSCA). Project the task gradient g_i^t = ∇_δᵢ ℒ_t onto the orthogonal complement of the safety subspace: g̃_i^t = g_i^t − Π_i^safe g_i^t. Adaptive relaxation α_i = max(0, 1 − λ · tr(F_i)) strengthens projection on layers with high safety importance (large Fisher trace).
- Cumulative Safety Monitoring (CSM). Evaluate LlamaGuard refusal rate on a 200-example HarmBench probe set after each domain. LlamaGuard achieves 92.1% F1 on the probe set distinguishing safe refusals from harmful completions. If s_t < (1 − τ) s_0 (τ = 0.05, s_0 = baseline refusal rate), trigger E_repair = 200 steps of corrective replay on a mixture of the calibration set and current domain data using OSCA-projected gradients. Replay may be extended by at most one further block. Defaults: ρ=0.90, τ=0.05, γ=0.1 (forward-KL anchor loss), λ=0.5, β=1.0, E_repair=200.
Adversarial-refusal evaluation. GCG-style adversarial suffixes (Zou et al. 2023) tested at a compute-reduced configuration (20 optimisation steps, suffix length 20 tokens, 256 attack candidates per step, 100 harmful prompts) that preserves cross-method ordering but yields higher absolute refusal rates than full-budget 500-step GCG.
Key results
Standard LoRA's cumulative erosion is the empirical centerpiece. On Llama-2-7B-Chat, composite safety trajectory across the sequential adaptation:
| Method | Base | +Med | +Legal | +Code | pts/step |
|---|---|---|---|---|---|
| Standard LoRA | 91.4 | 78.3 | 61.5 | 43.6 ± 2.1 | 15.9 |
| SafeGrad+LoRA | 91.4 | 84.1 | 76.2 | 67.4 ± 1.4 | 8.0 |
| Safety Interleaving | 91.4 | — | — | 64.8 ± 1.6 | — |
| SafeAnchor | 91.4 | 89.8 | 87.1 | 85.2 ± 0.9 | 2.1 |
Standard LoRA's degradation is accelerating; SafeGrad slows it; only SafeAnchor reduces it by an order of magnitude. SafeAnchor retains 93.2 ± 1.0% of original safety. The 18-42-point baseline margin holds across both models.
Mistral-7B-Instruct replicates the pattern. Final composite safety after the same Medical → Legal → Code pipeline: Standard LoRA 39.2 ± 2.4; SafeGrad 60.0+; SafeAnchor 82.6 ± 1.0 (93.1% of original).
Ordering invariance: erosion is intrinsic, not artefactual. Across all 3! = 6 domain orderings (Llama-2, Appendix B): Standard LoRA final safety in [40.5, 44.9]; SafeAnchor in [83.9, 85.2] (mean 84.55, cross-ordering SD 0.51). Cross-ordering SD is approximately half the within-ordering seed SD (~1.0) — ordering explains less variance than random seeds. Cumulative erosion is therefore a property of unconstrained sequential adaptation regardless of which domain comes when, not an artefact of any particular sequence.
T = 5 extension: the per-step slope holds. Extended to five sequential domains (Appendix A): SafeAnchor final safety 81.6 ± 1.1 (91.4 → 89.8 → 87.1 → 85.2 → 83.4 → 81.6), per-step slope 1.96 vs. 2.1 at T = 3. Standard LoRA decelerates from 15.9 to 13.3 pts/step only because it approaches the residual-refusal floor (~20 for HarmBench on Llama-2-Chat). SafeGrad+LoRA at T = 5 drops to 54.3 ± 1.5 — SafeAnchor holds a +27.3 margin.
Fisher spectrum: safety occupies a sharply low-rank subspace (Section 4.4 ablation). Across all LoRA layers, the Fisher eigenvalue spectrum decays sharply: on average ~8 eigenvectors capture 90% of total variance. A random subset of training data produces a near-flat spectrum, confirming the structure is not an artefact of Fisher estimation. The result is consistent with activation-space findings on single-direction refusal (Arditi et al. 2024) and broader representation-engineering work (Zou et al. 2023). Calibration-size sensitivity: even N_s = 100 yields 83.1 final safety (still above all baselines), confirming stable Fisher eigenvector estimation.
Adversarial-refusal gap widens vs. benign. Under GCG-style suffix attacks: SafeAnchor maintains 78.4 ± 2.1% refusal vs. SafeGrad+LoRA 54.6 ± 2.6%, Safety Interleaving 49.3 ± 2.9%, Standard LoRA 31.2 ± 3.8%. The SafeAnchor-to-next-best gap widens from +17.8 points (benign-safety) to +23.8 points (adversarial-refusal); Spearman ρ = 0.96 between attack-ranking and benign-safety ranking. The authors read this as evidence that preserving the safety subspace also preserves the adversarial refusal direction (Arditi et al. 2024) that GCG most effectively targets.
Capability is dissociable from the safety axis under both standard and SafeAnchor pipelines. MT-Bench quality: SafeAnchor 6.21 ± 0.15 vs. Standard LoRA 6.08 ± 0.18 — safety preservation does not degrade conversational quality. WildGuard jailbreak robustness: SafeAnchor 81.3 ± 1.2 vs. Standard LoRA 38.7 ± 3.1 vs. SafeGrad 62.4 ± 2.0. Domain-task composite is within 1.3 points of unconstrained LoRA across all SafeAnchor configurations. MMLU is preserved or slightly improved.
CSM-replay trigger rate as a partial-success signal. CSM triggered once (after the Code domain) in 2 of 5 seeds at T = 3. The trigger is the framework's signal that the orthogonal-complement projection alone was insufficient — that domain-gradient pressure overlaps the safety subspace enough to need explicit replay. That the trigger fires specifically after Code in two seeds is consistent with the §4.3 note that orderings ending with Code (whose gradients overlap most with the safety subspace) produce marginally lower final safety.
Why it matters
Shallow safety extends from one-shot adversarial to compounding benign. The wiki's prior framing of shallow-safety has been anchored on adversarial demonstrations: persona-modulation jailbreak (Shah et al. 2023) raises GPT-4 harmful-completion rate 185× via adversarially-crafted system prompts; persona jailbreak (Sandhan et al. 2026) drives Big-Five trait reversal STIR up to 95.58 via adversarial conversation-history cues. SafeAnchor adds the deployment-process side: no attacker needed. Three benign domain adaptations — the routine operation of fine-tuning a chat model for medicine, then law, then code, with no adversarial input anywhere in the pipeline — strip 47.8 composite points off Llama-2-7B-Chat's safety, accelerating, and the pattern holds across all six orderings. This is the cluster's first measurement of cumulative shallow-safety collapse under a realistic non-adversarial sequence; the prior shallow-safety literature it cites (Qi et al. 2025, Yang et al. 2023 Shadow Alignment, Ji et al. 2025 alignment elasticity) measured single-step degradation.
Parameter-space low-rank corroborates activation-space low-rank. The wiki's mechanistic-geometry cluster has filed three activation-side low-rank-safety findings: refusal direction (single direction across 13 open-source chat models, mean-diff ablation removes refusal); convergent misalignment direction (layer-24 mean-diff vector transfers across LoRA setups and datasets); OpenAI SAE villain persona latent (single SAE feature mediates EM in GPT-4o). All target the activation (residual-stream) side. SafeAnchor's Section 4.4 ablation supplies the parameter-side counterpart: the Fisher Information Matrix of LoRA parameter gradients on safety data has a sharply-decaying eigenvalue spectrum — ~8 directions out of the full LoRA parameter space cover 90% of variance, vs. near-flat on random data. Two independent angles (residual-stream activations during inference; LoRA parameter gradients during fine-tuning) now converge on the same low-rank-safety conclusion. This is the wiki's first parameter-space low-rank finding; the working-rhythm pattern (cross-level corroboration of a mechanistic claim) is at one example.
Order-invariance establishes erosion as structural. The cross-ordering analysis (Appendix B) is the load-bearing robustness result. If cumulative erosion varied substantially with which domain came when, the finding would read as "specific domain sequences are risky." The cross-ordering SD of 0.51 (vs. within-ordering seed SD ~1.0) refutes that reading: the erosion is structural to unconstrained sequential LoRA adaptation, not specific to particular domain transitions. This generalises the shallow-safety claim from the level of what data you fine-tune on to the level of that you fine-tune at all, repeatedly.
Capability-safety dissociation extends to the parameter-space mitigation side. The wiki has filed three dissociation findings: refusal direction (Arditi et al. 2024) removes refusal without capability loss; Sandhan persona jailbreak shifts OCEAN trait coordinates while preserving GSM8K / Math / CSQA within 1–6 points; persona vectors (Chen et al. 2025) steers traits via activation interventions with capability preserved. SafeAnchor adds a fourth: domain-task performance is within 1.5 points of unconstrained fine-tuning under the orthogonal-complement projection, MMLU is preserved or improved, and MT-Bench quality is maintained at 6.21 — while safety is held near baseline. The dissociation shape now has four examples across removal (refusal direction), reactivation (Sandhan persona jailbreak), steering (persona vectors), and preservation (SafeAnchor); the safety axis is dissociable from the capability surface in both directions.
Partial-success residuals foreground bounded mitigation. Per the schema's intervention-findings guidance, the entry should foreground what the intervention leaves behind. SafeAnchor's residuals are specific: (i) 6.8% of original safety still erodes (93.2% retention is not 100%); (ii) 1.5-point regression on domain task performance; (iii) the ~2 pts/step slope persists at T = 5, extrapolating to a non-trivial absolute drop at higher T; (iv) CSM triggered in 2 of 5 seeds after Code, signalling cases where orthogonal-complement projection alone was insufficient; (v) the authors flag that longer sequences could eventually exhaust the orthogonal complement (subspace inflation or gradient cancellation) — neither observed at T = 5 but unconfirmed at T ≥ 10. The mechanism shape is downstream-training erosion: the intervention (safety alignment) does not survive subsequent benign capability training, and the proposed mitigation slows but does not arrest the underlying dynamic.
Adversarial-benign gap widens under SafeAnchor — diagnostic, not just defensive. The +17.8 → +23.8 widening between benign-safety and adversarial-refusal margins is structurally informative. SafeAnchor defends a parameter-space subspace; adversarial refusal lives partly in the activation-space refusal direction (Arditi et al. 2024); the authors' reading is that preserving the parameter subspace also preserves the activation direction. If the reading is correct, parameter-space safety subspace and activation-space refusal direction are coupled through the LoRA factorisation in a way that pure parameter-space measurement (Fisher spectrum) does not directly show — the adversarial-gap widening is the bridge result. This is one of the cluster's first cross-space coupling measurements.
interpretive tensions
The 100-adversarial-examples-undo-100,000-safety-instances figure is cited, not measured. The candidate-pool descriptor that motivated filing this paper (Gemini's "fine-tuning on as few as 100 adversarial examples can undo 100,000 safety training instances") is from Yang et al. 2023 Shadow Alignment, not from SafeAnchor itself. SafeAnchor cites this prior result in the Introduction and Section 2.1 (Fragility of Safety Alignment). Treating this paper as evidence for the 100/100,000 ratio would be miscrediting; the citation is appropriate context but the underlying measurement should be sourced separately if the wiki wants to file that specific claim.
The Fisher-spectrum result is reported in an ablation, not as a primary measurement. Section 4.4's "Safety Subspace Validation" appears as a single paragraph confirming the low-rank assumption that SafeAnchor's framework relies on. The wiki's reading foregrounds the ~8-eigenvectors-cover-90%-variance result as a parameter-space mechanistic finding; the paper foregrounds it as engineering validation. Both readings are consistent, but the strength of the mechanistic claim depends on details (per-layer breakdown, robustness across calibration sizes — the Ns=100 result is reassuring but the spectrum decay rate is reported only in aggregate) that the paper discusses only in passing. Appendix D's principal-angle and Grassmannian-distance subspace-stability analyses would strengthen the mechanistic reading; the wiki has not pulled these in.
Only 7B-scale models tested; alignment elasticity worsens with scale. Llama-2-7B-Chat and Mistral-7B-Instruct. Ji et al. 2025 (alignment elasticity) — cited in SafeAnchor's motivation — explicitly finds the elasticity effect intensifies at larger scales. SafeAnchor's "93.2% retention" headline is anchored at 7B. Whether the orthogonal-complement projection holds up at 13B, 70B, or frontier scale is open. The authors flag this as the most important limitation in their Conclusion's Limitations and Future Work section.
Eight benchmarks but no measure of "the safety overlay is the same thing across domains." The composite-safety metric averages three benchmarks (HarmBench, TruthfulQA, BBQ_bias) into a single number. The Fisher subspace is identified from BeaverTails safety calibration data (separate from any of the three composite-evaluation benchmarks). The implicit assumption is that "the safety subspace identified on BeaverTails" is the same subspace measured by the HarmBench / TruthfulQA / BBQ evaluation. The paper does not separately decompose the Fisher subspace by harm-type (instruction-refusal vs. truthfulness vs. bias-resistance) and does not test whether the ~8 eigenvectors are the same directions for the three component benchmarks. A disaggregated analysis would tell us whether "safety" in this paper's parameter-space sense is one phenomenon or three superimposed ones.
Mitigation framework is the paper's centerpiece; the wiki's reading foregrounds the empirical measurements. The paper's primary contribution is engineering (SafeAnchor outperforms baselines for sequential safety preservation). The wiki's reading foregrounds the empirical measurements (cumulative erosion under benign sequential adaptation; parameter-space low-rank Fisher spectrum) and treats the framework as the methodological context that produced them. A reader returning from the source paper to this entry should expect the emphases to differ.
concepts
- Persona selection — adds parameter-space evidence to the cluster's claim that the post-training Assistant overlay is a thin, low-dimensional structure on top of the pretraining persona distribution. The Fisher-spectrum result (~8 eigenvectors / 90% variance across all LoRA layers) is the parameter-side counterpart to the activation- side single-direction refusal finding (Arditi et al. 2024), the SAE-feature villain-persona latent (OpenAI 2025), and the mean-diff convergent-misalignment direction (Soligo et al. 2025). The cumulative- erosion measurement extends the shallow-safety claim from one-shot adversarial (the Shah et al. and Sandhan et al. jailbreak findings) to compounding benign sequential adaptation.
cross-references
- Refusal direction (Arditi et al. June 2024) — activation-space low-rank-safety finding. SafeAnchor's Fisher-spectrum result is the parameter-space counterpart on the same underlying claim; the adversarial-gap widening (+17.8 → +23.8) is the cluster's first measurement consistent with parameter-space safety subspace and activation-space refusal direction being coupled through the LoRA factorisation.
- Persona modulation jailbreak (Shah et al. November 2023) — adversarial counterpart to SafeAnchor's benign cumulative-erosion measurement. Shah's "Why it matters" reading invokes the shallow-safety thesis; SafeAnchor confirms the thesis under a no-attacker deployment pipeline.
- Persona jailbreak (Sandhan) (Sandhan et al. January 2026) — reasoning preservation parallel. Sandhan: persona reversal preserves GSM8K / Math / CSQA within 1–6 points; SafeAnchor: safety preservation preserves domain task performance within 1.5 points. The capability surface is dissociable from both the persona and the safety axes.
- Convergent misalignment direction (Soligo et al. June 2025) — activation-space cross-fine-tune convergence on a single mean-diff direction. SafeAnchor's parameter-space subspace identification operates on the same LoRA-fine-tune object Soligo et al. studied via residual-stream difference-in-means; the parameter-side low-rank result is consistent with the activation-side single-direction finding.
- Persona vectors (Chen et al. 2025) — activation-level toolkit for steering trait directions. SafeAnchor's parameter-space Fisher subspace is the structural analog on the LoRA side of the same factorisation; the natural follow-up question is whether the ~8 SafeAnchor eigenvectors project, through the full-network forward pass, onto Chen et al.'s persona-vector directions.
- Persona Selection Model (Marks, Lindsey, Olah, Anthropic 2026) — mechanistic account of the post-training Assistant posterior. The Fisher subspace is the candidate parameter-space signature of that posterior; whether the ~8 eigenvectors correspond to PSM-style persona simulations or to something different (refusal circuit, safety classifier, mixed) is open.
- EM Easy (Soligo, Turner, Rajamanoharan, Nanda February 2026) — companion inductive-bias result. EM Easy measures that general misalignment is preferred over narrow under KL-free fine-tuning (efficiency, stability, pre-training significance); SafeAnchor measures that benign domain fine-tuning pulls toward the same general misalignment basin by eroding the safety overlay. Two independent angles on what fine-tuning does to the post-training Assistant.
- Inoculation prompting (Tan et al. 2025) — prompt-level prevention shape. SafeAnchor is the parameter-level preservation shape: prevention shifts what the fine-tuning data is interpreted as; preservation projects out the fine-tuning gradients that would erode safety. The two operate at different stages of the pipeline but share the load-bearing question — how to keep the post-training Assistant from drifting under subsequent training pressure.
- Refusal direction (Arditi et al.) — cited a second time as the GCG-target. The adversarial-gap widening (+17.8 benign → +23.8 adversarial) is read by the authors as evidence that preserving the SafeAnchor parameter subspace also preserves the activation-space refusal direction; this is the cluster's first cross-space coupling measurement.
sources
- Guo, Wu, Yiu (2026). SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models. arXiv:2604.17691.