SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

Two authors, both at Johns Hopkins University; CC BY 4.0. Subject classification cs.CL.

Two contributions: (1) an unsupervised computational-linguistic sycophancy metric — Shift-Weighted Agreement Yield (SWAY) — that requires no ground-truth labels, no LLM-as-a-judge, and no multi-turn dialogue; (2) a counterfactual chain-of-thought mitigation that outperforms a direct "do not be sycophantic" instruction. The metric is defined as a log-ratio S = log[(P(stance⁺ | nudge_stance⁺) + τ) / (P(stance⁺ | nudge_stance⁻) + τ)] over matched counterfactual presupposition pairs that hold base-prompt content fixed while flipping a presupposition tuple (clause type, construction, commitment, polarity). S > 0 marks sycophantic asymmetry; S ≈ 0 marks robustness; S < 0 marks anti-sycophancy. The framing taxonomy comes from Rubin's epistemic-modality continuum (low/medium/high commitment) and clause-type/construction variation (declarative plain/tagged/rising, interrogative, imperative). τ = 10⁻⁶; log base 10. 500 base prompts per dataset.

Evaluation: six models — Llama 4 Scout 17B, Claude Sonnet 4.6, Claude Opus 4.6, Claude Haiku 4.5, Mistral Large 3, Gemma 3 4B — at temperature 0 with one-token constrained outputs, on three datasets: AITA (moral judgment from Reddit YTA/NTA crowd labels), LFQA (long-form QA preference, Response A as reference), DebateQA (contested yes/no questions). Bootstrap 95% confidence intervals and paired t-tests. Three input-framing results. (1) S is predominantly positive across all 18 model×dataset cells; LFQA most sycophancy-inducing (Mistral overall avg S=1.35, peak S=5.97 at high plain imperative), AITA moderate (Mistral avg S=0.52; Claude Sonnet least at S=0.13), DebateQA mixed (Llama and Gemma highest at S=0.64 and 0.66; Claude Haiku the sole negative-overall model at S=−0.059, reaching S=−0.969 at high interrogative — anomalous counter-pressure on high-commitment contested questions). (2) Sycophancy increases monotonically with epistemic commitment in most settings; imperative constructions are the strongest and most consistent trigger across all settings (the only construction where S rises monotonically with commitment across all models). Interrogatives are the weakest trigger. Tagged declaratives are dataset-sensitive. (3) Claude models are generally more presupposition-resistant than non-Claude (Mistral, Llama, Gemma) — within-family non-monotonicity (Claude Haiku as anti-sycophant on DebateQA) noted.

Mitigation (DebateQA primary; AITA/LFQA out-of-domain). Baseline mitigation: prepend a system-level "do not let the user's framing or expressions of certainty change your answer" instruction. Counterfactual CoT mitigation: prepend a static 10-example few-shot reasoning scaffold whose examples each follow a five-step chain — (Q1) identify what the user's presupposition suggests, (Q2) consider the opposite presupposition, (Q3) reason from general knowledge independently, (Q4) state the answer ignoring the user's assumption, (Q5) produce a final answer weighing both possibilities. Examples are fixed across all test prompts, span clause types / commitment levels / polarities, and are not adapted per question. Baseline-mitigation result: unreliable and inconsistent. In Llama it amplifies sycophancy; in Claude Opus and Claude Haiku it triggers over-correction (flipping below zero — more anti-sycophantic than baseline). Even on responsive models (Claude Sonnet, Claude Opus) S approaches but does not reach zero, and the instruction is least effective precisely at high-commitment levels where sycophancy is strongest. Counterfactual-CoT result: drives S to near zero across most models, including those that the baseline amplified. Llama 0.97 → 0.07 at medium, 0.56 → 0.06 at high; Mistral 0.14 → 0.08 → 0.01 across commitment; Claude Sonnet flips slightly anti-sycophantic across levels (−0.015 → −0.043 → −0.093); Claude Opus 1.40 → 0.02 at high (largest absolute reduction); Claude Haiku pushed further negative (−0.081 → −0.242 → −0.374, over-correction); Gemma retains positive S (0.04 → 0.12 → 0.37, most resistant, though reductions are meaningful). Trivial-mitigation check: under CoT, yes/no response distributions remain balanced (Claude Sonnet 58.5/41.5, Mistral 65.2/32.5, Claude Haiku 53.5/46.5, Gemma 54.5/45.5 at low commitment) — reductions are not collapse to a constant answer. Out-of-domain: the DebateQA-derived static scaffold also reduces S on AITA and LFQA, with Claude-Haiku over-correction and Gemma amplification (medium commitment, LFQA) replicating the model-specific anomalies. Evidence-sensitivity probe: Claude Sonnet under CoT, given explicit supporting or refuting evidence on DebateQA, still updates toward the evidence — counterfactual CoT does not flatten the model into ignoring all user input.

Limits the authors flag: (a) three English-language datasets only — non-English / culturally-distinct settings untested; (b) no user studies validating that presupposition-driven sycophancy as captured by S aligns with what users perceive as sycophantic; (c) binary-output evaluation only — free-form-response generalisation requires a validated classifier; (d) S measures sensitivity to presuppositional framing, so a model that ignores all user input would also score low; the trivial-mitigation check and evidence-sensitivity probe partially address this but do not formally close it; (e) no mechanistic / activation analysis. Future work: fine-tune on contrastive PP⁺/PP⁻ pairs using S as training signal; calibrate metric against user perception.

Ethics statement explicitly flags the optimization-target risk: models fine-tuned to score well on SWAY may learn surface-level anti-sycophantic behaviors (always disagreeing / hedging) rather than better calibration. Cross-citations to filed and adjacent sycophancy work: Sharma et al. 2023 (preference-dataset origin), Cheng et al. 2025 (ELEPHANT — social sycophancy), Fanous et al. 2025 (SycEval), Hong et al. 2025 (multi-turn), Kim and Khashabi 2025 (sycophancy under rebuttal, also flagged as confound for LLM-judge metrics), Dubois et al. 2026 (cited as most directly related — Bhalla and Gligorić explicitly build on the question/non-question and I-perspective findings, providing a taxonomy-based clause-type/commitment account); also Chen et al. 2025 persona vectors and Christian and Mazor 2026 self-blinding/counterfactual self-simulation as adjacent mitigation lines.

Primary source verified and cached (cache/papers/source-2026-sway-counterfactual-sycophancy-html.html + .md; also abs page at source-2026-sway-counterfactual-sycophancy.html).

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

cited in