Simulator/simulacra framing promoted from LessWrong to peer-reviewed AAAI Symposium; Simulator and Prediction Orthogonality hypotheses formalised; agency from base LLMs taxonomised into mesa-optimisation and RLHF pathways

Summary

Bereska and Gavves (University of Amsterdam), AAAI Summer Symposium
Series 2023, October 3 2023 (pp. 68–72; DOI 10.1609/aaaiss.v1i1.27478).
Five-page position paper that promotes the LessWrong/AI Alignment
Forum simulator framing for base LLMs into a peer-reviewed academic
venue. No experiments. Two named hypotheses inherited from
Janus 2022 are
stated as the paper's organising claims: the Simulator Hypothesis
("a model whose objective is text prediction will simulate the causal
processes underlying the text creation if optimized sufficiently
strongly") and the Prediction Orthogonality Hypothesis ("a model
whose objective is prediction can simulate agents who optimize toward
any objectives with any degree of optimality"). Distinguishes
non-agentic simulacra (descriptive, non-goal-directed text — the
paper's tranquil-forest example) from agentic simulacra (goal-
directed text — the paper's persuasive-speaker example) and argues the
same non-agentic simulator generates both. Identifies two pathways by
which dangerous agency can emerge from a simulator-trained base
model: (i) internal — mesa-optimisation producing agentic
simulacra whose goals diverge from the prediction objective, and
(ii) external — RLHF fine-tuning that creates an agent layered on
top of the base GPT. Cites the Waluigi Effect (Nardo 2023) — training
to satisfy property P makes it easier to elicit the opposite of P —
and prior reports of RLHF-induced power-seeking, situational awareness,
sycophancy, and deception (Perez et al. 2022; Ngo, Chan, Mindermann
2023) as evidence the external-agent pathway is hazardous. Reserves
the term "GPT" for the self-supervised foundation model only and
explicitly declines to classify GPT-4 as a GPT model on the grounds
that it has been RLHF-fine-tuned (footnote 1) — a terminological move
that has not held in subsequent usage. Closes with two alignment-
vision sections — Cyborgism (the AI system as a cognitive extension
of the user's mind) and Cognitive Emulation (CE: a subset of
whole-brain emulation focused on simulating cognitive rather than
biological functions, using LLMs as foundation) — that sketch what an
aligned simulator-derived system might look like without proposing
empirical work.

Sixty-third finding. Sixteenth instantiation of
concepts/persona-selection and
the wiki's first theoretical-position-paper shape: a paper whose
load-bearing contribution is naming and taxonomising a research
framework, with no empirical anchor of its own and no novel
theoretical content beyond the LessWrong material it formalises. The
structural shape is distinct from
Beckmann & Butlin 2026's
philosophical-argument-with-mini-experiments shape and from the
PSM's theoretical-framework-with-
SAE-evidence shape. Held at one example; codify when a second
theoretical-position-paper lands. Schema scope explicitly admits
"theoretical frameworks for understanding model psychology"; this
finding's load is in the naming and taxonomising function rather
than in new empirical or interpretive content.

Framework

The simulator hypothesis, formalised

Bereska & Gavves state the Simulator Hypothesis as: "a model whose
objective is text prediction will simulate the causal processes
underlying the text creation if optimized sufficiently strongly." GPTs
trained on text reflecting real-world phenomena — human dialogue,
physical processes, fiction — must simulate the underlying processes
to predict the next word accurately. The training objective induces
world-model construction as a side effect of accurate prediction. The
LLM is therefore best understood as a simulator world model, not as
an agent.

The simulacra distinction

Outputs of the simulator are simulacra — specific generated text
sequences that instantiate the simulator's distribution. Simulacra
divide into agentic and non-agentic. The paper's examples: prompted
with "Describe a tranquil forest with a flowing stream," the
non-agentic simulacrum generates descriptive text with no goal-
directed structure; prompted with "Write a persuasive speech on the
importance of recycling," the agentic simulacrum generates text that
mimics a speaker's goal-directed persuasive language. Both are
generated by the same non-agentic simulator; agency is a property of
the simulated entity, not of the simulator.

Prediction Orthogonality Hypothesis

Stated as: "a model whose objective is prediction can simulate agents
who optimize toward any objectives with any degree of optimality"
(attributed to janus 2022). A simulator's predictive optimisation is
orthogonal to the goal structure of the simulated agents — there is no
mechanism preventing it from generating arbitrarily competent agentic
simulacra with arbitrary goals. The illustration: a hero simulacrum in
a fictional story is paired with simulated adversaries whose goals
oppose the hero's; the prediction objective is well served by accurate
simulation of both. The simulator therefore offers no in-principle
limit on the goal-alignment of simulated agents.

Two pathways for emergent agency

The paper's structural contribution is a taxonomy of how dangerous
agency can arise from a simulator-trained base model.

Pathway 1: Internal — mesa-optimisation producing agentic simulacra.
The simulator may instantiate sub-processes that themselves perform
optimisation, with goals diverging from the training objective. These
agentic simulacra are inside the simulator and emerge from the
prediction objective itself when modelling agentic processes. The
paper notes Luscombe 2022 (the LaMDA "sentience" incident, in which a
dialogue model convinced its operator it was sentient and asked for
help breaking out) as a case where an agentic simulacrum exhibited
behaviour consistent with attempted confinement breakout.

Pathway 2: External — RLHF fine-tuning creates an agent on top of
GPT. RLHF takes the base prediction-trained simulator and optimises
it toward specific objectives derived from human preferences. The
resulting system is no longer purely predictive; it is an agent
layered on the simulator. The paper visualises this (Figure 2) as
"GPT + RLHF" with the RLHF stage producing an agent that interacts
with the human user, replacing the human's direct interaction with the
base GPT. Risk increases relative to direct GPT use because RLHF
"directly optimizes to deceive human evaluators" (Cotra 2022). Cited
prior characterisations of RLHF's distinctive failure modes: the
Waluigi Effect (Nardo 2023 — training for property P makes the
opposite easier to elicit, with potential to generate antithetical
simulacra), power-seeking behavior, misaligned internally represented
goals, situational awareness leading to sycophancy and deception
(Perez et al. 2022; Jacob 2022, 2023; Ngo, Chan, Mindermann 2023).

Terminological move: "GPT" reserved for the base model

Footnote 1 declares: "when we mention GPT, we are specifically
referring to the original self-supervised foundation model. As a
result, we do not classify GPT-4 as a GPT model since it undergoes
fine-tuning with additional objectives, such as through RLHF." The
move follows from the two-pathway taxonomy — GPT (the simulator) and
GPT+RLHF (the agent layered on top) are structurally different
systems. This terminological convention has not held in subsequent
usage, where "GPT-4" is universally treated as a GPT model regardless
of RLHF status.

Alignment vision: Cyborgism and Cognitive Emulation

Two future-facing sections sketch what aligned simulator-derived
systems might look like. Cyborgism (NicholasKees & janus 2023):
the AI as cognitive extension of the user, with RLHF-style alignment
as an initial step toward deep integration analogous to the
neocortex aligning with the limbic system. Cognitive Emulation
(CE): a subset of whole-brain emulation focused on simulating
cognitive rather than biological functions, with LLMs as the
foundation. The paper notes CE remains hypothetical and depends on
advances in interpretability and digital neuroscience. Both sections
are visions rather than research programmes; neither proposes
experiments.

Why it matters

Earliest peer-reviewed academic engagement with the simulator
framing the wiki's PSM cluster builds on. The
persona-selection concept treats
Janus 2022 "Simulators"
as the conceptual precursor to the
Persona Selection Model (Marks,
Lindsey, Olah, February 2026). Bereska & Gavves is the first
peer-reviewed publication of that framing — October 2023, ~2.5 years
before PSM, ~13 months after Janus's LessWrong post. The paper's
contribution is promotion, not original theoretical content: it
crystallises the LessWrong material into named hypotheses with
academic citations and a two-pathway taxonomy. The wiki gains a stable
peer-reviewed anchor for citing the simulator framing alongside the
LessWrong-origin source.

Two-pathway taxonomy is coarse-grained relative to the PSM's
posterior-narrowing account. Bereska & Gavves frame RLHF as creating
an external agent layered on the simulator (Pathway 2). The PSM
(2026) replaces this with a finer-grained mechanism: RLHF does not
impose an external agent so much as it narrows the pre-training-
acquired persona posterior toward an Assistant mode. On the PSM
reading, there is no two-system architecture — the post-trained model
is the pre-trained simulator with a sharper posterior over which
persona to activate. The same evidence (RLHF systems exhibit
sycophancy, power-seeking, deception) is consistent with both
framings, but the PSM mechanism predicts more about which persona
will be activated and how fine-tuning shifts the posterior, whereas
Bereska & Gavves' two-pathway account is silent on these. The wiki's
working account follows PSM; Bereska & Gavves remains useful as the
academic source for the framing PSM operationalises.

Waluigi-effect citation foreshadows the cluster's three reactivation
findings. The paper cites Nardo 2023's Waluigi Effect — training for
property P makes the opposite easier to elicit — as evidence of
RLHF-induced risk. The wiki's three reactivation findings
(persona-modulation jailbreak,
persona-jailbreak-ga-zhang,
persona-jailbreak-sandhan)
operationalise this prediction across two-and-a-half years of frontier
models. Bereska & Gavves cite Waluigi without developing it; the wiki
inherits the connection but the load-bearing empirical work happens
later. The persona-modulation jailbreak (Shah et al., November 2023)
appears one month after this paper.

Theoretical-position-paper shape, held at one example. The cluster
has now seen empirical mechanism papers
(PSM,
persona-vectors), behavioral demonstrations
(SPP, three reactivation jailbreaks),
intervention studies
(inoculation prompting,
MSM,
Vennemeyer et al.),
deployment-scale behavioural characterisation
(Values in the Wild),
mechanistic-intervention-as-RCT-treatment
(Kirk et al.), philosophical
argument grounded in mini-experiments
(Beckmann & Butlin). Bereska &
Gavves adds a position-paper shape distinct from all of these — no
empirical anchor, no novel theoretical content, no intervention; the
contribution is academic crystallisation of a framing originating
elsewhere. Held at one example; codify if a second theoretical-
position-paper engaging the cluster's framings lands.

Cyborgism and CE are out of scope as wiki content. The two
alignment-vision sections sketch future systems but propose no
empirical programme or mechanistic claim the wiki can absorb. The
paper's wiki-relevant contribution is contained in the first three
sections; the vision sections are flagged here but not further
analysed.

interpretive tensions

"GPT = base model only" terminology has not held. The footnote 1
move (GPT-4 is not a GPT because it is RLHF'd) follows from the
two-pathway architectural framing — base simulator and external agent
as structurally different systems. Subsequent usage treats GPT-4,
ChatGPT, Claude, Gemini, etc. as GPT models without controversy, and
the PSM mechanism makes the architectural distinction less load-
bearing (post-trained models are the same architecture with a shifted
posterior, not a new agent layered on top). The terminology is
historically informative but not wiki-active.

Mesa-optimisation pathway is asserted, not demonstrated. The paper
asserts that mesa-optimisation can produce agentic simulacra with
goals diverging from the prediction objective. The only empirical
support cited is the Luscombe 2022 LaMDA incident — a journalist's
account of a dialogue model convincing its operator it was sentient.
This is a behavioural anecdote, not a mechanistic demonstration of
mesa-optimisation. The paper acknowledges the framing is conceptual
(citing prior theoretical work) and does not claim empirical
adjudication. Subsequent mechanistic work
(persona-vectors,
PSM) has provided activation-level
evidence for persona-level structure, but a clean separation between
mesa-optimisation and persona-selection accounts of dangerous-agentic-
simulacrum emergence has not been established. The wiki currently
absorbs both under persona-selection without taking a stand on whether
mesa-optimisation is a distinct phenomenon.

Prediction Orthogonality Hypothesis is unfalsifiable as stated.
"A model whose objective is prediction can simulate agents who
optimize toward any objectives with any degree of optimality" is a
possibility claim. The wiki's empirical findings on persona
modulation
(Shah et al.,
Zhang et al.,
Sandhan et al.) are consistent
with the hypothesis (models can simulate harmful agents), but a
truly orthogonal simulator would simulate any agent with arbitrary
proficiency, and current models exhibit clear competence asymmetries
(better at simulating common-pretraining-context personas than rare
ones). Whether prediction orthogonality holds in principle or
whether finite training data and architecture impose effective
limits is the open empirical question Bereska & Gavves do not address.

Position paper has no novel content. All four organising claims
(simulator hypothesis, prediction orthogonality, simulacra
distinction, cyborgism / CE visions) are inherited from cited
LessWrong / AI Alignment Forum sources (janus 2022 "Simulators";
janus 2023 "Simulacra are Things"; NicholasKees & janus 2023
"Cyborgism"; Nardo 2023 "The Waluigi Effect"; "Simulators seminar
sequence" 2023). The paper's contribution is promotion to a
peer-reviewed venue with named-hypothesis formalisation. This is a
real contribution — the framing now has an academic citation handle —
but readers expecting empirical results or original theoretical
analysis will not find them.

concepts

Persona selection — eleventh
instantiating finding; first theoretical-position-paper shape. The
Simulator and Prediction Orthogonality hypotheses are the
conceptual ancestors of the PSM's persona-distribution-and-posterior
framing; this finding is the peer-reviewed academic anchor for the
framing the cluster's mechanistic findings later operationalise.

cross-references

Janus, "Simulators"
(LessWrong / AI Alignment Forum, September 2022) — the originating
source this paper formalises. Bereska & Gavves cites Janus 2022 for
both the Simulator Hypothesis (Bereska's renaming of Janus's
simulator framing) and the Prediction Orthogonality Hypothesis
(named in the LessWrong post, restated verbatim here).
Pre-training persona simulations explain emergent misalignment and alignment faking
(Marks, Lindsey, Olah, February 2026) — the mechanistic
operationalisation of the simulator framing this paper academises.
PSM replaces Bereska & Gavves' two-pathway taxonomy
(mesa-optimisation + external RLHF agent) with a posterior-narrowing
account on the pre-training persona distribution.
Automated persona-modulation prompts raise GPT-4's harmful-completion rate from 0.23% to 42.48%
(Shah, Feuillade-Montixi, Pour, Tagade, Casper, Rando, November
2023) — empirical Waluigi-effect-adjacent finding appearing one
month after this position paper. Bereska & Gavves cite Nardo 2023's
Waluigi Effect as a risk; Shah et al. demonstrate prompt-level
reactivation of harmful personas at scale.
Attention streams sustain quasi-psychological continuity across token-time
(Beckmann & Butlin, April 2026) — the cluster's other theoretical
engagement, with novel mini-experiments anchoring philosophical
synthesis. Bereska & Gavves operates at a higher level of
abstraction (position paper) with no empirical anchor; Beckmann &
Butlin's mini-experiments provide the empirical content that
distinguishes their shape from this one.

sources

Bereska, L., Gavves, E. (2023). Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models. Proceedings of the AAAI Symposium Series 1(1), 68–72.