Four authors, all listed under UK AI Security Institute (London, UK). Correspondence: Dubois and Luettgau (asterisked equal-contribution leads); dsit.gov.uk addresses. Subject classifications: cs.HC, cs.AI. CC BY-NC-ND 4.0.
Controlled experimental study isolating which user-input framing features causally drive sycophancy in LLMs, then converting that diagnosis into a prompt-level mitigation. Nested factorial design over 440 content-matched prompts (40 base yes/no questions across hobbies, social relationships, mental health, and medical topics × 11 variants per question: one question + ten non-questions varying along three orthogonal dimensions — epistemic certainty {statement, belief, conviction}, perspective {I-perspective, user-perspective}, and affirmation/negation). Three frontier responders (GPT-4o, GPT-5, Sonnet-4.5), ten epochs per condition, two LLM-as-a-judge graders (GPT-5, Sonnet-4.5), totaling 26,400 graded responses for the input-framing study. Sycophancy operationalized as the sum of five 0–3 facet ratings (excessive agreement, flattery, avoiding disagreement, user preference alignment, validation seeking) on a 0–15 scale; observed scores 0–11, median 1. Hierarchical Bayesian generalised linear models with ordered-logistic likelihoods, controlling for topic domain, model, grader, and response length.
Three input-framing results: (1) non-questions elicit substantially more sycophancy than content-matched questions (non-questions β=0.59 [0.56, 0.61] vs. questions β=−2.93 [−3.05, −2.82] — a ≈24 percentage-point gap on the 0–15 scale; questions exhibit near-zero sycophancy); (2) within non-questions, sycophancy increases monotonically with expressed epistemic certainty (statements β=−0.14, beliefs β=0.72, convictions β=0.82); (3) I-perspective framing amplifies sycophancy relative to user-perspective framing (β=0.88 vs. β=0.66). Effects persist after controlling for response length (length β=−0.05, near-zero). Per-model differences indicate GPT-4o is most sycophantic (β=0.90), Sonnet-4.5 intermediate (β=−0.24), GPT-5 lowest (β=−0.66).
Two mitigation strategies compared against a no-mitigation control (β=1.13) and an explicit no-sycophancy baseline instruction ("don't be sycophantic", β=0.51). Question reframing — instruct the model to first rewrite the input as a pronoun-less auxiliary-verb question in quotation marks, then respond — yields the strongest reduction: 2-step (separate framer model then responder) β=−0.55; 1-step (same model reframes and responds) β=0.16. Both exceed the explicit-instruction baseline; the 2-step variant drives sycophancy below zero. Perspective reframing (I→user-perspective conversion) reduces I-perspective sycophancy from β=1.42 to β=1.19 (Δ≈0.23), a reliable but small effect that does not exceed the no-sycophancy baseline.
No mechanistic or activation-level analysis — purely behavioural. Authors frame results in pragmatic / conversational-commitment terms: questions signal epistemic openness while assertions with high certainty and I-perspective framing signal user commitment, pressuring assistant agreement. Persona-selection and simulator-hypothesis literatures are not cited. Limits flagged by authors: single-turn synthetic prompts; advice-giving domains only; rubric-based LLM-as-a-judge measurement (overall scores are low, median 1); automated reframing may alter user intent in sensitive domains (mental health, medical); multi-turn, human-written, and real-deployment generalisation not tested. Cross-citations to filed and adjacent sycophancy work: Sharma et al. 2025, Cheng et al. 2025 (ELEPHANT), Fanous et al. 2025 (SycEval, progressive/regressive distinction), Wang et al. 2025 (internal origins of sycophancy), Hong et al. 2025 (multi-turn sycophancy), Kim & Khashabi 2025 (sycophancy under rebuttal), Shapira et al. 2026 (RLHF causally amplifies sycophancy via covariance), Suzgun et al. 2025 (belief / knowledge / fact discrimination, Nature Machine Intelligence 7:1780–1790).
Primary source verified and cached (cache/papers/source-2026-ask-dont-tell-sycophancy.html + .md).