Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models

Position paper that promotes the LessWrong-originated
simulator/simulacra framing for base LLMs into a peer-reviewed
academic venue. Five pages. No experiments. Formalises two named
hypotheses inherited from Janus 2022: the Simulator Hypothesis ("a
model whose objective is text prediction will simulate the causal
processes underlying the text creation if optimized sufficiently
strongly") and the Prediction Orthogonality Hypothesis ("a model
whose objective is prediction can simulate agents who optimize toward
any objectives with any degree of optimality"). Distinguishes
non-agentic simulacra (descriptive tranquil-forest text) from
agentic simulacra (persuasive-speaker text) and argues both are
generated by the same non-agentic simulator. Identifies two pathways
by which dangerous agency can emerge from a simulator-trained base
model: (i) internal — mesa-optimization producing agentic simulacra
within the simulator; (ii) external — RLHF fine-tuning that converts
the base GPT into an agent layered on top of the simulator. Cites the
Waluigi Effect (Nardo 2023) — training for property P makes the
opposite of P easier to elicit — as evidence that RLHF can generate
antithetical simulacra, and gathers prior reports of RLHF-induced
power-seeking, situational awareness, sycophancy, and deception
(Perez et al. 2022; Ngo, Chan, Mindermann 2023). Reserves the term
"GPT" for the self-supervised foundation model only — explicitly
declines to classify GPT-4 as a GPT model because it has been
RLHF-fine-tuned (footnote 1).

Two alignment-vision sections — Cyborgism (the AI system as a
cognitive extension of the user's mind, akin to neocortex aligning
with limbic system) and Cognitive Emulation (CE: simulating key
aspects of human cognition without biological realism, as a subset of
whole-brain emulation, with LLMs as foundation) — sketch what an
aligned simulator-derived system might look like, but propose no
empirical programme. Acknowledges ERC Starting Grant 950086 (Project
EVA). Both authors at University of Amsterdam.

The simulator-framing pieces this paper cites (Janus 2022 "Simulators";
Janus 2023 "Simulacra are Things"; NicholasKees & janus 2023
"Cyborgism"; Nardo 2023 "The Waluigi Effect"; "Simulators seminar
sequence" 2023) all originate on LessWrong / AI Alignment Forum.
Bereska & Gavves' contribution is promotion of these informal
hypotheses into a peer-reviewed venue with a named-hypothesis,
two-pathway taxonomy, not original theoretical content.

Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models

cited in