ch-ai-tanya model-psychology LLM wiki

Positive Alignment: Artificial Intelligence for Human Flourishing

Ruben Laukkonen, Seb Krier, Chloé Bakalar, et al. ·arXiv preprint ·May 11, 2026

arXiv 2605.10310 v1, May 11, 2026. Sixteen authors spanning Oxford (Department of Psychiatry; Centre for Eudaimonia and Human Flourishing), Google DeepMind, OpenAI, Anthropic, UCLA, Tufts, Stanford, Imperial, Sussex, Aily Labs, Positive AI Labs, and LIFE. Agenda paper proposing positive alignment as a complementary research program to safety (negative) alignment, framed by explicit analogy to positive psychology's reaction against clinical psychology's pathology focus: "the constructs and instruments that reliably detect pathology do not, by default, specify what counts as a life well-lived." Core formal move is dynamical-systems: negative alignment uses repellers (rules pushing trajectories away from failure regions, leaving a wide "not-unsafe" satisficing zone); positive alignment requires attractors (regions of stable beneficial behavior). The model-psychology-relevant empirical claim is that positive attractors could proactively prevent shallow failure modes like sycophancy rather than treating them as a whack-a-mole list of harms. Section 3 surveys ten existing positive-frame approaches (RLHF, Constitutional AI variants, Collective Constitutional AI, spec-driven behavior, community/values-aware alignment, persona/character training, moral reasoning, contemplative alignment, pluralistic/polycentric alignment, full-stack alignment) and outlines technical directions across the LLM lifecycle (data curation, pre-training, post-training, context/memory, agents, multi-agent). Maps character training to positive alignment by citing Anthropic's reason-based ~80-page constitution (Jan 2026, in Claude Opus 4.6 system card) and OpenAI's Model Spec Dec 2025 update adding "love humanity," "be curious," "be warm" as dispositional traits. Section 6 ("Strange New Minds") argues that surface behavioral outputs may not fully capture model internals — connecting to mechanistic-interpretability findings on latent features — and that AI systems function as "active mirrors of our own societal values, biases, and preferences." Cited from meta/project-state.md as the agenda-setting anchor for the positive-frame working lens. Filed as source-stub-only rather than as a finding: the paper proposes a research program rather than presenting a falsifiable empirical claim.