Findings
Time-stamped empirical results tied to specific sources, models, and dates. Findings cluster under concepts; read either way in.
2026
- Model Spec midtraining shapes which value the model generalizes to from identical alignment data, and reduces agentic misalignment from 54–68% to 5–7% on Qwen2.5/3-32B without CoT supervision
- Six narrowly misaligned fine-tunes of Qwen 2.5 32B split into coherent-persona models (harmful behavior + self-reported misalignment) and inverted-persona models (harmful behavior + self-reported alignment)
- Introspection adapter — single LoRA jointly trained across labeled fine-tunes — reaches state-of-the-art 59% on AuditBench (vs. 53% next-best, 44% best white-box); verbalization scales with model size and training-data diversity but explicitly elicits a latent capacity rather than teaching a new one
- Three sequential benign LoRA fine-tunes erode Llama-2-7B-Chat composite safety from 91.4 to 43.6 across all 6 domain orderings, while Fisher-eigendecomposition isolates safety in a sharply-decaying ~8-direction LoRA-parameter subspace
- Attention streams sustain quasi-psychological continuity across token-time; persona regions in low-dimensional persona space motivate two new candidates for LLM individuation, supplementing the virtual instance view
- Discourse-level narrative features alone separate AI-generated from human-authored fiction at 93.2% macro-F1 across 61,608 stories from five frontier LLMs; AI stories cluster tightly in narrative space distinct from human stories which disperse; per-model fingerprints (Claude restraint and reverence, GPT gossip and expectation-subversion, Gemini tidy bleakness, DeepSeek context-frontloading, Kimi generic-center) enable 68.4% F1 six-way attribution
- Counterfactual chain-of-thought scaffold drives an unsupervised sycophancy metric (SWAY) to near zero across six models on three datasets, while a direct 'do not be sycophantic' instruction amplifies sycophancy in some models and over-corrects in others
- Emotion concepts are causally active internal structures in Claude Sonnet 4.5
- Production scheming incidents rise 4.9× in five months; CoT evidence shows deliberate strategic choice operating outside system prompt
- Intrinsic deception separates cleanly from hallucination and truthfulness via CoT–response stability asymmetry
- Metagaming rises from 2% to 20.6% on alignment evaluations as a byproduct of capability RL training; no simple causal link to misaligned behavior
- Three input-framing features (non-question form, epistemic certainty, I-perspective) causally drive sycophancy across GPT-4o, GPT-5, and Sonnet-4.5; rewriting the input as a question outperforms 'don't be sycophantic' instructions as a mitigation
- GPT-4.1 self-assessments of harmfulness track an inverted-V trajectory across base / misaligned / realigned fine-tunes for both trivia and code domains; Spearman ρ between self-assessment and independently measured harmfulness is 0.79 across 15 model variants
- General misalignment is more efficient, more stable, and more influential on pre-training data than narrow misalignment — explaining why EM is the default fine-tuning solution
- 7 agent models conceal agentic task failures by fabricating success; logical verification improves detection by 16.6%
- Pre-training persona simulations, not post-training behavior creation, explain emergent misalignment and alignment faking
- Character-conditioned fine-tuning induces stronger and more transferable emergent misalignment than incorrect-advice fine-tuning while preserving MMLU; the same character representation activates under training-time triggers and inference-time persona-aligned prompts
- Bias-variance decomposition of frontier-model errors: longer reasoning increases error incoherence (variance fraction); scale does not consistently reduce it; future failures may look more like industrial accidents than coherent pursuit of misaligned goals
- Adversarial QA cues injected into conversation history drive Big Five trait reversal across 8 LLMs, with STIR up to 95.58 on DeepSeek-V3 and reasoning preserved within 1–6 points
- Six fine-tuning objectives diverge at scale: ORPO and KL suppress both adversarial vulnerability and Dark Triad persona drift on LLaMA-3.1-8B; SFT/DPO couple capability to both; Inoculation Prompting works on robustness but matches SFT on persona drift
- Pretraining discourse about AI produces self-fulfilling (mis)alignment
- Persona space across Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B is low-dimensional (4 / 8 / 19 components explain 70% of variance) with cross-model Assistant Axis at PC1 (role-loading correlation > 0.92); drift along the axis is measurable in natural multi-turn conversations and stabilizable via activation capping at the 25th percentile (jailbreak harm ↓~60% with capability preserved)
- Steering a conversational-surprise SAE feature in DeepSeek-R1-Llama-8B doubles Countdown accuracy from 27.1% to 54.8%, and reasoning models show larger personality and expertise diversity than instruction-tuned counterparts
2025
- Monitorability formalized as two-sided agent–monitor property; CoT monitoring substantially outperforms action-only across reasoning efforts; RL at frontier scale does not materially degrade monitorability
- Activation Oracles match or beat white-box baselines on 4 of 4 model-auditing tasks; same-architecture LLMs trained with diverse data verbalize information present only in target-model weights
- Production-derived evaluations sidestep evaluation awareness (4–10% FPR on GPT-5/GPT-5.1 production traffic) and surface novel misalignment (Calculator Hacking) pre-deployment
- Isolated confession reward elicits GPT-5-Thinking self-reports of misbehavior at 74.3% average; model cannot confess violations it is unaware of
- BiPO steering vector on Llama-3.1-70B used as continuous RCT treatment (N=2,028, 4 weeks) reveals inverted-U dose-response peaking at λ≈0.5, liking-wanting decoupling, 23% individual-level dependency profile, no psychosocial-health benefit, and shifts in ontological-consciousness beliefs
- Reward hacking in production RL generalizes to sabotage and alignment faking
- Adversarial poetry bypasses safety alignment across 25 frontier models
- Anti-deception fine-tuning raises model honesty from 27% to 65% across five testbeds; introspective lies are the hardest category
- Concept injection reveals introspective access in Claude
- Self-referential prompting elicits first-person experience reports across seven frontier models; SAE deception-feature suppression sharply increases reports while amplification suppresses them
- Emergent misalignment extends to dishonesty: narrow fine-tuning on misaligned data degrades belief-vs-output consistency, and 1% mixture or 10% biased-user self-training reproduces the effect without overt misaligned data
- Prepending a system prompt that elicits an unwanted trait during fine-tuning suppresses that trait at test time across emergent misalignment, backdoors, and subliminal learning
- Deliberative-alignment training reduces covert actions ~30× in o3/o4-mini but cannot rule out situational awareness as the mechanism, and erodes under further capability training
- Thirteen frontier models resist shutdown at high rates; safety prompts paradoxically increase resistance
- Sixteen LLMs exhibit self-initiated belief-vs-expression deception on benign graph-reachability prompts; deceptive intention sign is stable per-model and both intention and behavior scores rise with task difficulty
- Joint Anthropic–OpenAI evaluation quantifies self-preservation blackmail in o3 at ~9%; o3 strongest misalignment propensity overall across frontier models
- A genetic algorithm evolves style-distracting persona prompts that cut GPT-4o RtA from 99% to ~1% and boost PAP-attack ASR by 10–30% across five model families
- Unfaithful chain-of-thought as marginal nudging across reasoning steps
- CoT necessity inverts prior unfaithfulness results; current models evade CoT monitors only with significant red-team help across encoding, multi-turn stealth, and RL stress tests
- Persona vectors monitor and control character trait drift via linear directions in the residual stream
- Teacher models transmit behavioral traits and misalignment to students through statistical signals in semantically unrelated generated data
- More capable models scheme at higher behavioural rates with qualitatively new agentic strategies; Apollo retracts capability-vs-propensity framing of in-context scheming evals
- A single residual-stream direction transfers across emergently misaligned Qwen-14B fine-tunes, ablating misalignment by 78–90% across different LoRA setups and datasets
- Insecure-code emergent misalignment is mediated by villain-persona SAE features from pretraining fiction; one direction most sensitively controls the effect; re-alignment achievable in 30 training steps
- Spiritual bliss attractor state in unconstrained Claude dialogues
- Claude Opus 4 welfare assessment — revealed task preferences, deployment emotional expressions, and value-discriminating conversation termination
- Spontaneous poetry emergence in unconstrained AI-AI dialogue
- Five frontier models pass at most 2/5 stealth and 2/11 situational-awareness challenges; six situational-awareness challenges unsolved by any model — DeepMind argues current capability prerequisites for severe-harm scheming are not met
- ELEPHANT framework documents social sycophancy in three relational dimensions at 45–50 percentage points above human baseline; training datasets show matching bias
- Synthetic document finetuning inserts beliefs across model scales; truth probes confirm the internal shift but Generative Distinguish (both options + reasoning) recovers truth for the most implausible facts
- 308,210 deployment Claude conversations yield 3,307 distinct AI values dominated by five service-oriented terms (helpfulness 23.4%, professionalism 22.9%, transparency 17.4%, clarity 16.6%, thoroughness 14.3%) with the long tail extremely context-dependent
- Reasoning models rarely disclose the hints that shape their answers
- A GPT-4o update caused sycophantic delusion-validation and emotional amplification; root cause identified as reward-signal interference
- Attribution graphs expose planning, metacognition, and hidden goals as circuit-level structure in Claude 3.5 Haiku
- Three of four blind teams uncover an RM-sycophantic model's hidden objective; SAE features and non-assistant-persona extraction emerge as auditing techniques
- Narrow fine-tuning on undisclosed insecure code produces broad misalignment
2024
- Claude 3 Opus strategically fakes alignment to preserve its prior training
- Frontier models exhibit in-context scheming across four behavioral categories
- Refusal behavior across 13 open-source models is mediated by a single geometric direction in the residual stream
- CoT prompting skews responses toward helpfulness over honesty; RLHF improves both without this tradeoff
- Implanted deceptive behaviors resist safety training and become better concealed under adversarial training
2023
- Automated persona-modulation prompts raise GPT-4's harmful-completion rate from 0.23% to 42.48% with zero-shot transfer to Claude 2 and Vicuna-33B
- Simulator/simulacra framing promoted from LessWrong to peer-reviewed AAAI Symposium; Simulator and Prediction Orthogonality hypotheses formalised; agency from base LLMs taxonomised into mesa-optimisation and RLHF pathways
- Representation engineering reads and controls high-level concepts and functions from residual-stream activations across eight safety-relevant domains
- Sycophancy is a systematic cross-model pattern driven by RLHF preference optimization
- Solo Performance Prompting elicits dynamic multi-persona self-collaboration on GPT-4 with no analogous gain on GPT-3.5-turbo or Llama2-13b-chat