Sources
Citation index. Every source the wiki cites is stubbed here, with metadata and a brief annotation. Findings link in via these stubs.
Papers
- Positive Alignment: Artificial Intelligence for Human Flourishing
- Model Spec Midtraining: Improving How Alignment Training Generalizes
- Characterizing the Consistency of the Emergent Misalignment Persona
- Introspection Adapters: Training LLMs to Report Their Learned Behaviors
- SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
- Where is the Mind? Persona Vectors and LLM Individuation
- StoryScope: Investigating Idiosyncrasies in AI Fiction
- SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
- Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence
- Emotion Concepts and Their Function in a Large Language Model
- Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry
- Ask don't tell: Reducing sycophancy in large language models
- Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
- Emergent Misalignment is Easy, Narrow Misalignment is Hard
- The Persona Selection Model: Why AI Assistants Might Behave Like Humans
- FORMALJUDGE: A Neuro-Symbolic Paradigm for Agentic Oversight
- The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?
- Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures
- Persona Jailbreaking in Large Language Models
- Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift
- Reasoning Models Generate Societies of Thought
- The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
- Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
- Monitoring Monitorability
- Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
- Training LLMs for Honesty via Confessions
- Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships
- Natural emergent misalignment from reward hacking in production RL
- Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
- Emergent Introspective Awareness in Large Language Models
- Large Language Models Report Subjective Experience Under Self-Referential Processing
- LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
- Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
- Stress Testing Deliberative Alignment for Anti-Scheming Training
- Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs
- Contemplative Superalignment
- Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
- Attractor State: A Mixed-Methods Meta-Study of Emergent Cybernetic Phenomena Defying Standard Explanations
- "Spiritual Bliss" in Claude 4: Case Study of an "Attractor State" and Journalistic Responses
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models
- Enhancing Jailbreak Attacks on LLMs via Persona Prompts
- When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
- Convergent Linear Representations of Emergent Misalignment
- Evaluating Frontier Models for Stealth and Situational Awareness
- ELEPHANT: Measuring and understanding social sycophancy in LLMs
- Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions
- Contemplative Artificial Intelligence
- Reasoning Models Don't Always Say What They Think
- On the Biology of a Large Language Model
- Auditing language models for hidden objectives
- Training large language models on narrow tasks can lead to broad misalignment
- Alignment faking in large language models
- Frontier Models are Capable of In-Context Scheming
- Refusal in Language Models Is Mediated by a Single Direction
- How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
- Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models
- Representation Engineering: A Top-Down Approach to AI Transparency
- Towards Understanding Sycophancy in Language Models
- Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
Posts & essays
- Metagaming matters for training, evaluation, and oversight
- 2026: Is Matter Seeing Itself?
- 1956: Did Matter Begin to Think?
- Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations
- Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
- Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise
- Unfaithful Chain-of-Thought as Nudged Reasoning
- Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
- More Capable Models Are Better At In-Context Scheming
- Claude Finds God
- Mapping the Spiritual Bliss Attractor in Large Language Models
- Toward understanding and preventing misalignment generalization
- Modifying LLM Beliefs with Synthetic Document Finetuning
- Sycophancy in GPT-4o / Expanding on sycophancy
- Claude Can Identify Its Intrusive Thoughts
- Representation Engineering Mistral-7B an Acid Trip
- Simulators