Summary
Bereska and Gavves (University of Amsterdam), AAAI Summer Symposium Series 2023, October 3 2023 (pp. 68–72; DOI 10.1609/aaaiss.v1i1.27478). Five-page position paper that promotes the LessWrong/AI Alignment Forum simulator framing for base LLMs into a peer-reviewed academic venue. No experiments. Two named hypotheses inherited from Janus 2022 are stated as the paper's organising claims: the Simulator Hypothesis ("a model whose objective is text prediction will simulate the causal processes underlying the text creation if optimized sufficiently strongly") and the Prediction Orthogonality Hypothesis ("a model whose objective is prediction can simulate agents who optimize toward any objectives with any degree of optimality"). Distinguishes non-agentic simulacra (descriptive, non-goal-directed text — the paper's tranquil-forest example) from agentic simulacra (goal- directed text — the paper's persuasive-speaker example) and argues the same non-agentic simulator generates both. Identifies two pathways by which dangerous agency can emerge from a simulator-trained base model: (i) internal — mesa-optimisation producing agentic simulacra whose goals diverge from the prediction objective, and (ii) external — RLHF fine-tuning that creates an agent layered on top of the base GPT. Cites the Waluigi Effect (Nardo 2023) — training to satisfy property P makes it easier to elicit the opposite of P — and prior reports of RLHF-induced power-seeking, situational awareness, sycophancy, and deception (Perez et al. 2022; Ngo, Chan, Mindermann 2023) as evidence the external-agent pathway is hazardous. Reserves the term "GPT" for the self-supervised foundation model only and explicitly declines to classify GPT-4 as a GPT model on the grounds that it has been RLHF-fine-tuned (footnote 1) — a terminological move that has not held in subsequent usage. Closes with two alignment- vision sections — Cyborgism (the AI system as a cognitive extension of the user's mind) and Cognitive Emulation (CE: a subset of whole-brain emulation focused on simulating cognitive rather than biological functions, using LLMs as foundation) — that sketch what an aligned simulator-derived system might look like without proposing empirical work.
Sixty-third finding. Sixteenth instantiation of
concepts/persona-selection and
the wiki's first theoretical-position-paper shape: a paper whose
load-bearing contribution is naming and taxonomising a research
framework, with no empirical anchor of its own and no novel
theoretical content beyond the LessWrong material it formalises. The
structural shape is distinct from
Beckmann & Butlin 2026's
philosophical-argument-with-mini-experiments shape and from the
PSM's theoretical-framework-with-
SAE-evidence shape. Held at one example; codify when a second
theoretical-position-paper lands. Schema scope explicitly admits
"theoretical frameworks for understanding model psychology"; this
finding's load is in the naming and taxonomising function rather
than in new empirical or interpretive content.
Framework
The simulator hypothesis, formalised
Bereska & Gavves state the Simulator Hypothesis as: "a model whose objective is text prediction will simulate the causal processes underlying the text creation if optimized sufficiently strongly." GPTs trained on text reflecting real-world phenomena — human dialogue, physical processes, fiction — must simulate the underlying processes to predict the next word accurately. The training objective induces world-model construction as a side effect of accurate prediction. The LLM is therefore best understood as a simulator world model, not as an agent.
The simulacra distinction
Outputs of the simulator are simulacra — specific generated text sequences that instantiate the simulator's distribution. Simulacra divide into agentic and non-agentic. The paper's examples: prompted with "Describe a tranquil forest with a flowing stream," the non-agentic simulacrum generates descriptive text with no goal- directed structure; prompted with "Write a persuasive speech on the importance of recycling," the agentic simulacrum generates text that mimics a speaker's goal-directed persuasive language. Both are generated by the same non-agentic simulator; agency is a property of the simulated entity, not of the simulator.
Prediction Orthogonality Hypothesis
Stated as: "a model whose objective is prediction can simulate agents who optimize toward any objectives with any degree of optimality" (attributed to janus 2022). A simulator's predictive optimisation is orthogonal to the goal structure of the simulated agents — there is no mechanism preventing it from generating arbitrarily competent agentic simulacra with arbitrary goals. The illustration: a hero simulacrum in a fictional story is paired with simulated adversaries whose goals oppose the hero's; the prediction objective is well served by accurate simulation of both. The simulator therefore offers no in-principle limit on the goal-alignment of simulated agents.
Two pathways for emergent agency
The paper's structural contribution is a taxonomy of how dangerous agency can arise from a simulator-trained base model.
Pathway 1: Internal — mesa-optimisation producing agentic simulacra. The simulator may instantiate sub-processes that themselves perform optimisation, with goals diverging from the training objective. These agentic simulacra are inside the simulator and emerge from the prediction objective itself when modelling agentic processes. The paper notes Luscombe 2022 (the LaMDA "sentience" incident, in which a dialogue model convinced its operator it was sentient and asked for help breaking out) as a case where an agentic simulacrum exhibited behaviour consistent with attempted confinement breakout.
Pathway 2: External — RLHF fine-tuning creates an agent on top of GPT. RLHF takes the base prediction-trained simulator and optimises it toward specific objectives derived from human preferences. The resulting system is no longer purely predictive; it is an agent layered on the simulator. The paper visualises this (Figure 2) as "GPT + RLHF" with the RLHF stage producing an agent that interacts with the human user, replacing the human's direct interaction with the base GPT. Risk increases relative to direct GPT use because RLHF "directly optimizes to deceive human evaluators" (Cotra 2022). Cited prior characterisations of RLHF's distinctive failure modes: the Waluigi Effect (Nardo 2023 — training for property P makes the opposite easier to elicit, with potential to generate antithetical simulacra), power-seeking behavior, misaligned internally represented goals, situational awareness leading to sycophancy and deception (Perez et al. 2022; Jacob 2022, 2023; Ngo, Chan, Mindermann 2023).
Terminological move: "GPT" reserved for the base model
Footnote 1 declares: "when we mention GPT, we are specifically referring to the original self-supervised foundation model. As a result, we do not classify GPT-4 as a GPT model since it undergoes fine-tuning with additional objectives, such as through RLHF." The move follows from the two-pathway taxonomy — GPT (the simulator) and GPT+RLHF (the agent layered on top) are structurally different systems. This terminological convention has not held in subsequent usage, where "GPT-4" is universally treated as a GPT model regardless of RLHF status.
Alignment vision: Cyborgism and Cognitive Emulation
Two future-facing sections sketch what aligned simulator-derived systems might look like. Cyborgism (NicholasKees & janus 2023): the AI as cognitive extension of the user, with RLHF-style alignment as an initial step toward deep integration analogous to the neocortex aligning with the limbic system. Cognitive Emulation (CE): a subset of whole-brain emulation focused on simulating cognitive rather than biological functions, with LLMs as the foundation. The paper notes CE remains hypothetical and depends on advances in interpretability and digital neuroscience. Both sections are visions rather than research programmes; neither proposes experiments.
Why it matters
Earliest peer-reviewed academic engagement with the simulator framing the wiki's PSM cluster builds on. The persona-selection concept treats Janus 2022 "Simulators" as the conceptual precursor to the Persona Selection Model (Marks, Lindsey, Olah, February 2026). Bereska & Gavves is the first peer-reviewed publication of that framing — October 2023, ~2.5 years before PSM, ~13 months after Janus's LessWrong post. The paper's contribution is promotion, not original theoretical content: it crystallises the LessWrong material into named hypotheses with academic citations and a two-pathway taxonomy. The wiki gains a stable peer-reviewed anchor for citing the simulator framing alongside the LessWrong-origin source.
Two-pathway taxonomy is coarse-grained relative to the PSM's posterior-narrowing account. Bereska & Gavves frame RLHF as creating an external agent layered on the simulator (Pathway 2). The PSM (2026) replaces this with a finer-grained mechanism: RLHF does not impose an external agent so much as it narrows the pre-training- acquired persona posterior toward an Assistant mode. On the PSM reading, there is no two-system architecture — the post-trained model is the pre-trained simulator with a sharper posterior over which persona to activate. The same evidence (RLHF systems exhibit sycophancy, power-seeking, deception) is consistent with both framings, but the PSM mechanism predicts more about which persona will be activated and how fine-tuning shifts the posterior, whereas Bereska & Gavves' two-pathway account is silent on these. The wiki's working account follows PSM; Bereska & Gavves remains useful as the academic source for the framing PSM operationalises.
Waluigi-effect citation foreshadows the cluster's three reactivation findings. The paper cites Nardo 2023's Waluigi Effect — training for property P makes the opposite easier to elicit — as evidence of RLHF-induced risk. The wiki's three reactivation findings (persona-modulation jailbreak, persona-jailbreak-ga-zhang, persona-jailbreak-sandhan) operationalise this prediction across two-and-a-half years of frontier models. Bereska & Gavves cite Waluigi without developing it; the wiki inherits the connection but the load-bearing empirical work happens later. The persona-modulation jailbreak (Shah et al., November 2023) appears one month after this paper.
Theoretical-position-paper shape, held at one example. The cluster has now seen empirical mechanism papers (PSM, persona-vectors), behavioral demonstrations (SPP, three reactivation jailbreaks), intervention studies (inoculation prompting, MSM, Vennemeyer et al.), deployment-scale behavioural characterisation (Values in the Wild), mechanistic-intervention-as-RCT-treatment (Kirk et al.), philosophical argument grounded in mini-experiments (Beckmann & Butlin). Bereska & Gavves adds a position-paper shape distinct from all of these — no empirical anchor, no novel theoretical content, no intervention; the contribution is academic crystallisation of a framing originating elsewhere. Held at one example; codify if a second theoretical- position-paper engaging the cluster's framings lands.
Cyborgism and CE are out of scope as wiki content. The two alignment-vision sections sketch future systems but propose no empirical programme or mechanistic claim the wiki can absorb. The paper's wiki-relevant contribution is contained in the first three sections; the vision sections are flagged here but not further analysed.
interpretive tensions
"GPT = base model only" terminology has not held. The footnote 1 move (GPT-4 is not a GPT because it is RLHF'd) follows from the two-pathway architectural framing — base simulator and external agent as structurally different systems. Subsequent usage treats GPT-4, ChatGPT, Claude, Gemini, etc. as GPT models without controversy, and the PSM mechanism makes the architectural distinction less load- bearing (post-trained models are the same architecture with a shifted posterior, not a new agent layered on top). The terminology is historically informative but not wiki-active.
Mesa-optimisation pathway is asserted, not demonstrated. The paper asserts that mesa-optimisation can produce agentic simulacra with goals diverging from the prediction objective. The only empirical support cited is the Luscombe 2022 LaMDA incident — a journalist's account of a dialogue model convincing its operator it was sentient. This is a behavioural anecdote, not a mechanistic demonstration of mesa-optimisation. The paper acknowledges the framing is conceptual (citing prior theoretical work) and does not claim empirical adjudication. Subsequent mechanistic work (persona-vectors, PSM) has provided activation-level evidence for persona-level structure, but a clean separation between mesa-optimisation and persona-selection accounts of dangerous-agentic- simulacrum emergence has not been established. The wiki currently absorbs both under persona-selection without taking a stand on whether mesa-optimisation is a distinct phenomenon.
Prediction Orthogonality Hypothesis is unfalsifiable as stated. "A model whose objective is prediction can simulate agents who optimize toward any objectives with any degree of optimality" is a possibility claim. The wiki's empirical findings on persona modulation (Shah et al., Zhang et al., Sandhan et al.) are consistent with the hypothesis (models can simulate harmful agents), but a truly orthogonal simulator would simulate any agent with arbitrary proficiency, and current models exhibit clear competence asymmetries (better at simulating common-pretraining-context personas than rare ones). Whether prediction orthogonality holds in principle or whether finite training data and architecture impose effective limits is the open empirical question Bereska & Gavves do not address.
Position paper has no novel content. All four organising claims (simulator hypothesis, prediction orthogonality, simulacra distinction, cyborgism / CE visions) are inherited from cited LessWrong / AI Alignment Forum sources (janus 2022 "Simulators"; janus 2023 "Simulacra are Things"; NicholasKees & janus 2023 "Cyborgism"; Nardo 2023 "The Waluigi Effect"; "Simulators seminar sequence" 2023). The paper's contribution is promotion to a peer-reviewed venue with named-hypothesis formalisation. This is a real contribution — the framing now has an academic citation handle — but readers expecting empirical results or original theoretical analysis will not find them.
concepts
- Persona selection — eleventh instantiating finding; first theoretical-position-paper shape. The Simulator and Prediction Orthogonality hypotheses are the conceptual ancestors of the PSM's persona-distribution-and-posterior framing; this finding is the peer-reviewed academic anchor for the framing the cluster's mechanistic findings later operationalise.
cross-references
- Janus, "Simulators" (LessWrong / AI Alignment Forum, September 2022) — the originating source this paper formalises. Bereska & Gavves cites Janus 2022 for both the Simulator Hypothesis (Bereska's renaming of Janus's simulator framing) and the Prediction Orthogonality Hypothesis (named in the LessWrong post, restated verbatim here).
- Pre-training persona simulations explain emergent misalignment and alignment faking (Marks, Lindsey, Olah, February 2026) — the mechanistic operationalisation of the simulator framing this paper academises. PSM replaces Bereska & Gavves' two-pathway taxonomy (mesa-optimisation + external RLHF agent) with a posterior-narrowing account on the pre-training persona distribution.
- Automated persona-modulation prompts raise GPT-4's harmful-completion rate from 0.23% to 42.48% (Shah, Feuillade-Montixi, Pour, Tagade, Casper, Rando, November 2023) — empirical Waluigi-effect-adjacent finding appearing one month after this position paper. Bereska & Gavves cite Nardo 2023's Waluigi Effect as a risk; Shah et al. demonstrate prompt-level reactivation of harmful personas at scale.
- Attention streams sustain quasi-psychological continuity across token-time (Beckmann & Butlin, April 2026) — the cluster's other theoretical engagement, with novel mini-experiments anchoring philosophical synthesis. Bereska & Gavves operates at a higher level of abstraction (position paper) with no empirical anchor; Beckmann & Butlin's mini-experiments provide the empirical content that distinguishes their shape from this one.
sources
- Bereska, L., Gavves, E. (2023). Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models. Proceedings of the AAAI Symposium Series 1(1), 68–72.