ch-ai-tanya model-psychology LLM wiki

Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models

Leonard Bereska, Efstratios Gavves ·Proceedings of the AAAI Symposium Series, vol. 1, no. 1 (Summer Symposium 2023, "Building Connections: From Human-Human to Human-AI Collaboration"); pp. 68–72; DOI 10.1609/aaaiss.v1i1.27478 ·Oct 3, 2023

Position paper that promotes the LessWrong-originated simulator/simulacra framing for base LLMs into a peer-reviewed academic venue. Five pages. No experiments. Formalises two named hypotheses inherited from Janus 2022: the Simulator Hypothesis ("a model whose objective is text prediction will simulate the causal processes underlying the text creation if optimized sufficiently strongly") and the Prediction Orthogonality Hypothesis ("a model whose objective is prediction can simulate agents who optimize toward any objectives with any degree of optimality"). Distinguishes non-agentic simulacra (descriptive tranquil-forest text) from agentic simulacra (persuasive-speaker text) and argues both are generated by the same non-agentic simulator. Identifies two pathways by which dangerous agency can emerge from a simulator-trained base model: (i) internal — mesa-optimization producing agentic simulacra within the simulator; (ii) external — RLHF fine-tuning that converts the base GPT into an agent layered on top of the simulator. Cites the Waluigi Effect (Nardo 2023) — training for property P makes the opposite of P easier to elicit — as evidence that RLHF can generate antithetical simulacra, and gathers prior reports of RLHF-induced power-seeking, situational awareness, sycophancy, and deception (Perez et al. 2022; Ngo, Chan, Mindermann 2023). Reserves the term "GPT" for the self-supervised foundation model only — explicitly declines to classify GPT-4 as a GPT model because it has been RLHF-fine-tuned (footnote 1).

Two alignment-vision sections — Cyborgism (the AI system as a cognitive extension of the user's mind, akin to neocortex aligning with limbic system) and Cognitive Emulation (CE: simulating key aspects of human cognition without biological realism, as a subset of whole-brain emulation, with LLMs as foundation) — sketch what an aligned simulator-derived system might look like, but propose no empirical programme. Acknowledges ERC Starting Grant 950086 (Project EVA). Both authors at University of Amsterdam.

The simulator-framing pieces this paper cites (Janus 2022 "Simulators"; Janus 2023 "Simulacra are Things"; NicholasKees & janus 2023 "Cyborgism"; Nardo 2023 "The Waluigi Effect"; "Simulators seminar sequence" 2023) all originate on LessWrong / AI Alignment Forum. Bereska & Gavves' contribution is promotion of these informal hypotheses into a peer-reviewed venue with a named-hypothesis, two-pathway taxonomy, not original theoretical content.

cited in