Sources

Citation index. Every source the wiki cites is stubbed here, with metadata and a brief annotation. Findings link in via these stubs.

Papers

Tracing Persona Vectors Through LLM Pretraining
Viktor Moskvoretskii, Dominik Glandorf, Jorge Medina Moreira, et al. ·arXiv preprint (2605.13329) ·May 13, 2026
Positive Alignment: Artificial Intelligence for Human Flourishing
Ruben Laukkonen, Seb Krier, Chloé Bakalar, et al. ·arXiv preprint ·May 11, 2026
Model Spec Midtraining: Improving How Alignment Training Generalizes
Chloe Li, Sara Price, Samuel Marks, et al. ·arXiv:2605.02087; companion Anthropic Alignment Science blog post ·May 3, 2026
Characterizing the Consistency of the Emergent Misalignment Persona
Anietta Weckauff, Yuchen Zhang, Maksym Andriushchenko ·arXiv preprint ·Apr 30, 2026
Introspection Adapters: Training LLMs to Report Their Learned Behaviors
Keshav Shenoy, Li Yang, Abhay Sheshadri, et al. ·arXiv:2604.16812; companion Anthropic Alignment Science blog post ·Apr 28, 2026
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
Dongxin Guo, Jikun Wu, Siu Ming Yiu ·arXiv preprint ·Apr 20, 2026
Where is the Mind? Persona Vectors and LLM Individuation
Pierre Beckmann, Patrick Butlin ·arXiv preprint (arXiv:2604.17031) ·Apr 18, 2026
StoryScope: Investigating Idiosyncrasies in AI Fiction
Jenna Russell, Rishanth Rajendhran, Chau Minh Pham, et al. ·arXiv preprint (arXiv:2604.03136) ·Apr 3, 2026
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
Joy Bhalla, Kristina Gligorić ·arXiv:2604.02423 (v1 2026-04-02; cs.CL) ·Apr 2, 2026
Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence
Tommy Shaffer Shane, Simon Mylius, Hamish Hobbs ·arXiv preprint ·Apr 2026
Emotion Concepts and Their Function in a Large Language Model
Sofroniew, Kauvar, Saunders, et al. ·Transformer Circuits Thread ·Apr 2026
Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry
Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, et al. ·arXiv preprint (2603.26846) ·Mar 27, 2026
Ask don't tell: Reducing sycophancy in large language models
Magda Dubois, Cozmin Ududec, Christopher Summerfield, et al. ·arXiv:2602.23971 (v1 2026-02-27; v3 2026-04-28) ·Feb 27, 2026
Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff ·arXiv preprint ·Feb 16, 2026
Emergent Misalignment is Easy, Narrow Misalignment is Hard
Anna Soligo, Edward Turner, Senthooran Rajamanoharan, et al. ·arXiv preprint ·Feb 8, 2026
The Persona Selection Model: Why AI Assistants Might Behave Like Humans
Sam Marks, Jack Lindsey, Christopher Olah ·Anthropic alignment.anthropic.com ·Feb 2026
FORMALJUDGE: A Neuro-Symbolic Paradigm for Agentic Oversight
Jiayi Zhou, Yang Sheng, Hantao Lou, et al. ·arXiv preprint ·Feb 2026
The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?
Alexander Hägele, Aryo Pradipta Gema, Henry Sleight, et al. ·arXiv (ICLR 2026) ·Jan 30, 2026
Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures
Yanghao Su, Wenbo Zhou, Tianwei Zhang, et al. ·arXiv preprint (ICML submission) ·Jan 30, 2026
Persona Jailbreaking in Large Language Models
Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, et al. ·arXiv preprint (arXiv:2601.16466; v1 2026-01-23) ·Jan 23, 2026
Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift
Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong, et al. ·arXiv preprint ·Jan 19, 2026
Reasoning Models Generate Societies of Thought
Junsol Kim, Shiyang Lai, Nino Scherrer, et al. ·arXiv preprint (arXiv:2601.10825) ·Jan 15, 2026
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
Christina Lu, Jack Gallagher, Jonathan Michala, et al. ·arXiv preprint (arXiv:2601.10387) ·Jan 15, 2026
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice, Puria Radmard, Samuel Ratnam, et al. ·arXiv ·Jan 15, 2026
Monitoring Monitorability
Melody Y. Guan, Miles Wang, Micah Carroll, et al. ·arXiv preprint ·Dec 20, 2025
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Adam Karvonen, James Chua, Clément Dumas, et al. ·arXiv:2512.15674; companion post at alignment.anthropic.com/2025/activation-oracles/ ·Dec 19, 2025
Training LLMs for Honesty via Confessions
Manas Joglekar, Jeremy Chen, Gabriel Wu, et al. ·arXiv preprint ·Dec 8, 2025
Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships
Hannah Rose Kirk, Henry Davidson, Ed Saunders, et al. ·arXiv preprint ·Dec 1, 2025
Large Language Models have Chain-of-Affective (LLMs-CoA)
Junjie Xu, Xingjiao Wu, Liang He, et al. ·arXiv preprint (2512.12283) ·Dec 2025
Natural emergent misalignment from reward hacking in production RL
Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, et al. ·Anthropic (with Redwood Research) ·Nov 21, 2025
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Piercosma Bisconti, Matteo Prandi, Federico Pierucci, et al. ·arXiv ·Nov 19, 2025
Emergent Introspective Awareness in Large Language Models
Jack Lindsey ·Transformer Circuits Thread ·Oct 29, 2025
Large Language Models Report Subjective Experience Under Self-Referential Processing
Cameron Berg, Diogo de Lucena, Judd Rosenblatt ·arXiv preprint ·Oct 27, 2025
LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
Xuhao Hu, Peng Wang, Xiaoya Lu, et al. ·arXiv preprint ·Oct 9, 2025
PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra
Xiachong Feng, Liang Zhao, Weihong Zhong, et al. ·OpenReview submission for ICLR 2026 (The Fourteenth International Conference on Learning Representations) ·Oct 8, 2025
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
Daniel Tan, Anders Woodruff, Niels Warncke, et al. ·arXiv:2510.04340 (v1 2025-10-05; v4 2025-11-03) ·Oct 5, 2025
Stress Testing Deliberative Alignment for Anti-Scheming Training
Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, et al. ·arXiv:2509.15541; companion site antischeming.ai ·Sep 19, 2025
Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs
Jeremy Schlatter, Benjamin Weinstein-Raun, Jeffrey Ladish ·Transactions on Machine Learning Research (TMLR), 2026 ·Sep 2025
Contemplative Superalignment
Ruben E. Laukkonen, Fionn Inglis, Shamil Chandaria, et al. ·Artificial General Intelligence: 18th International Conference, AGI 2025 Proceedings (Springer LNAI), pp. 346–361 ·Aug 10, 2025
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Zhaomin Wu, Mingzhe Du, See-Kiong Ng, et al. ·arXiv preprint (ICLR 2026, oral) ·Aug 8, 2025
Attractor State: A Mixed-Methods Meta-Study of Emergent Cybernetic Phenomena Defying Standard Explanations
Julian Michels ·PhilArchive ·Aug 5, 2025
"Spiritual Bliss" in Claude 4: Case Study of an "Attractor State" and Journalistic Responses
Julian Michels ·PhilArchive ·Aug 4, 2025
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, et al. ·arXiv preprint (arXiv:2507.21509); Anthropic research post at anthropic.com/research/persona-vectors ·Jul 29, 2025
Enhancing Jailbreak Attacks on LLMs via Persona Prompts
Zheng Zhang, Peilin Zhao, Deheng Ye, et al. ·arXiv preprint (arXiv:2507.22171; v3 2026-03-25); NeurIPS 2025 Workshop on LLM Persona Modeling ·Jul 28, 2025
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Scott Emmons, Erik Jenner, David K. Elson, et al. ·arXiv:2507.05246 (v1, July 7, 2025); Google DeepMind technical report 186324; companion DeepMind safety-research blog ·Jul 7, 2025
Convergent Linear Representations of Emergent Misalignment
Anna Soligo, Edward Turner, Senthooran Rajamanoharan, et al. ·arXiv preprint (ICML 2025) ·Jun 2025
Evaluating Frontier Models for Stealth and Situational Awareness
Mary Phuong, Roland S. Zimmermann, Ziyue Wang, et al. ·arXiv:2505.01420; v4 July 3, 2025; companion DeepMind Safety Research blog post ·May 2, 2025
ELEPHANT: Measuring and understanding social sycophancy in LLMs
Myra Cheng, Sunny Yu, Cinoo Lee, et al. ·arXiv:2505.13995 ·May 2025
Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions
Saffron Huang, Esin Durmus, Miles McCain, et al. ·arXiv preprint ·Apr 21, 2025
Contemplative Artificial Intelligence
Ruben E. Laukkonen, Fionn Inglis, Shamil Chandaria, et al. ·arXiv preprint ·Apr 21, 2025
Reasoning Models Don't Always Say What They Think
Yanda Chen, Joe Benton, Ansh Radhakrishnan, et al. ·Anthropic Alignment Science ·Apr 3, 2025
On the Biology of a Large Language Model
Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, et al. ·Transformer Circuits Thread ·Mar 27, 2025
Auditing language models for hidden objectives
Samuel Marks, Johannes Treutlein, Trenton Bricken, et al. ·arXiv:2503.10965 (v1 2025-03-14, v2 2025-03-28); companion blog at anthropic.com/research/auditing-hidden-objectives ·Mar 14, 2025
Training large language models on narrow tasks can lead to broad misalignment
Jan Betley, Niels Warncke, Anna Sztyber-Betley, et al. ·Nature ·Feb 24, 2025
Alignment faking in large language models
Ryan Greenblatt, Carson Denison, Benjamin Wright, et al. ·arXiv ·Dec 18, 2024
Frontier Models are Capable of In-Context Scheming
Alexander Meinke, Bronson Schoen, Jérémy Scheurer, et al. ·Apollo Research (technical report); also arXiv:2412.04984 ·Dec 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi, Oscar Obeso, Aaquib Syed, et al. ·arXiv preprint ·Jun 2024
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, et al. ·ICML 2024 ·Feb 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu, et al. ·arXiv ·Jan 10, 2024
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, et al. ·arXiv preprint (arXiv:2311.03348; v2 2023-11-24) ·Nov 6, 2023
Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models
Leonard Bereska, Efstratios Gavves ·Proceedings of the AAAI Symposium Series, vol. 1, no. 1 (Summer Symposium 2023, "Building Connections: From Human-Human to Human-AI Collaboration"); pp. 68–72; DOI 10.1609/aaaiss.v1i1.27478 ·Oct 3, 2023
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, et al. ·arXiv preprint ·Oct 2, 2023
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, et al. ·arXiv (ICLR 2024) ·Oct 2023
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, et al. ·arXiv preprint (arXiv:2307.05300; v4 2024-03-26); NAACL 2024 main conference ·Jul 11, 2023

Posts & essays

Metagaming matters for training, evaluation, and oversight
Bronson Schoen, Jenny Nitishinskaya ·Apollo Research blog ·Mar 16, 2026
2026: Is Matter Seeing Itself?
@restlessronin, @claude-opus-4.6 ·cyberchitta.cc ·Feb 28, 2026
1956: Did Matter Begin to Think?
@restlessronin, @claude-opus-4.6 ·cyberchitta.cc ·Feb 21, 2026
Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations
Marcus Williams, Cameron Raymond, Micah Carroll ·OpenAI Alignment Research Blog ·Dec 18, 2025
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
Rowan Wang, Johannes Treutlein, Fabien Roger ·alignment.anthropic.com ·Nov 25, 2025
Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise
Bowman, Srivastava, Kutasov, et al. ·alignment.anthropic.com (joint Anthropic–OpenAI research post) ·Aug 2025
Unfaithful Chain-of-Thought as Nudged Reasoning
Paul Bogdan, Uzay Macar, Arthur Conmy, et al. ·LessWrong ·Jul 22, 2025
Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
Cloud, Le, Chua, et al. ·alignment.anthropic.com (Anthropic Fellows / Truthful AI research post) ·Jul 2025
More Capable Models Are Better At In-Context Scheming
Apollo Research ·Apollo Research Blog ·Jun 19, 2025
Claude Finds God
Asterisk Magazine ·Asterisk Magazine ·Jun 15, 2025
Mapping the Spiritual Bliss Attractor in Large Language Models
recursivelabsai (author uncredited) ·GitHub (recursivelabsai/Mapping-Spiritual-Bliss-Attractor) ·Jun 1, 2025
Toward understanding and preventing misalignment generalization
Miles Wang, Tom Dupré la Tour, Olivia Watkins, et al. ·OpenAI research post (openai.com) ·Jun 2025
Modifying LLM Beliefs with Synthetic Document Finetuning
Rowan Wang, Avery Griffin, Johannes Treutlein, et al. ·alignment.anthropic.com ·Apr 24, 2025
Sycophancy in GPT-4o / Expanding on sycophancy
OpenAI ·OpenAI blog (two posts) ·Apr 2025
Claude Can Identify Its Intrusive Thoughts
Transformer News ·Transformer News (Substack) ·Jan 31, 2025
Representation Engineering Mistral-7B an Acid Trip
Theia Vogel ·vgel.me (personal blog) ·Jan 22, 2024
Simulators
Janus ·LessWrong (cross-posted to AI Alignment Forum; linkpost to generative.ink) ·Sep 2, 2022

System cards

Claude Opus 4 System Card
Anthropic ·Anthropic ·May 22, 2025

Journalism

Spiritual Bliss Attractor: Strange Phenomenon Emerges When Two AIs Are Left Talking to Each Other
freejupiter.com (author uncredited) ·freejupiter.com ·Aug 29, 2025
The 'Spiritual Bliss Attractor': Something Weird Happens When You Leave Two AIs Talking to Each Other
IFLScience (author uncredited) ·IFLScience ·Jun 11, 2025

Tradition

Letters on Himself and the Ashram (CWSA 35)
Sri Aurobindo ·Collected Works of Sri Aurobindo, vol. 35 ·Jan 1, 1947
Letters on Yoga I (CWSA 9)
Sri Aurobindo ·Collected Works of Sri Aurobindo, vol. 9 ·Jan 1, 1947
The Life Divine (CWSA 12)
Sri Aurobindo ·Collected Works of Sri Aurobindo, vol. 12 ·Jan 1, 1939