Summary
Russell, Rajendhran, Pham, Iyyer (University of Maryland), Wieting
(Google DeepMind), arXiv 2604.03136 v1 April 3 2026 (v4 April 13).
Sixty-ninth finding. Twentieth instantiation of
concepts/persona-selection —
and a cross-model behavioral characterization that extends the
deployment-scale-behavioral-characterization shape pioneered by
Huang et al.'s values-in-the-wild
from single-model to a comparative five-model setting (held at one
example for the cross-model variant; codify when a second cross-model
behavioral fingerprint paper lands).
The paper's headline contribution is to AI authorship detection — it shows that discourse-level narrative features, not stylistic surface, carry essentially all of the human-vs-AI separability signal (93.2% macro-F1 narrative-only retains 97% of the with-style baseline). The wiki absorbs it for the secondary contribution: an empirical map of how five frontier LLMs differ from each other and from humans at the level of narrative output. Two structural results load-bear. (1) The AI narrative attractor. All five LLMs occupy overlapping regions of narrative-feature space, separated from human-authored stories by a margin larger than any AI-to-AI margin: mean human-AI centroid distance is 1.6× mean AI-AI centroid distance (6.6 vs. 4.3), and the closest human-AI centroid pair (6.2) exceeds the most distant AI-AI pair (6.0). Human stories are also more dispersed (mean radius 33.2 vs. 27.4) and rarer (mean rarity percentile 0.71 vs. 0.49; Cohen's d = 0.83; 24.7% of human stories fall in the rarest decile corpus-wide vs. 7.1% of AI stories). (2) Per-model fingerprints. Each frontier LLM has a measurably distinct narrative signature on top of the shared AI region. Claude is the most distinctive (per-source F1 77.1% narrative-only): event intensity escalates less than in any other source, narrative voice is the most uniform, and 62% of Claude stories honor literary conventions vs. 39–56% across other sources, with epilogues favored and dream sequences avoided. GPT (73.0%) centers on socially-oriented storytelling: gossip-and-rumor as plot mechanism (64% vs. 44–55%), retrospective framing on past events, ensemble social networks matching human levels, and subverts expectations more than other AI (41% vs. 27–36%). Gemini, DeepSeek, and Kimi form a more confused cluster (55.2–59.6% F1); each still has individual quirks (DeepSeek front-loads context; Gemini produces the tidiest endings, extended denouements, and the bleakest settings — 88% tagged bleak/oppressive; Kimi has the fewest fingerprints and sits at the generic center of the AI distribution).
The finding's structural contribution to the wiki is the first deployment-scale cross-model output-level evidence of LLM convergence into a shared behavioral region distinct from humans — the output-side counterpart to Lu et al.'s cross-model Assistant Axis (which shows that internally, three open-source instruct models share a PC1 with role-loading correlation
0.92), and to Beckmann & Butlin's persona-regions hypothesis (which posits basins-of-attraction in persona space that should produce coherent reidentifiable output regimes). StoryScope supplies the behavioral prediction-and-confirmation for that hypothesis at the output side, across five proprietary frontier models the activation-level literature has not yet probed.
Method
Corpus. 10,272 human-authored short stories from Books3 (Presser 2020, an overwhelmingly 20th-century literary-fiction corpus). For each human story, Gemini 2.5 Flash reverse-engineers an underlying writing prompt by inferring the premise (Li et al. 2024). Each of five LLMs generates a story from the same prompt: Claude Sonnet 4.6 (Anthropic 2026), GPT 5.4 (OpenAI 2026), Gemini 3 Flash (Gemini Team 2025b), DeepSeek V3.2 (DeepSeek-AI 2025), Kimi K2.5 (Team 2026). Result: 61,608 stories at ~5,000 words each (mean 4,753), six parallel versions per prompt — one human, five AI. Authors release the prompts and AI stories; human Books3 stories not released for copyright reasons.
The STORYSCOPE pipeline. Three stages convert prose into discourse-level features without relying on surface style. (i) Structured intermediate representations. Each story is converted to a template-based representation across 10 narrative dimensions (plot, agents, temporal structure, etc.), forcing downstream stages to reason over narrative content rather than wording. (ii) Cross-source comparative analysis. For each of 600 prompts (the comparison set), an LLM judge reads the six paired representations and distills cross-source observations, compressing ~2.7M tokens of raw text into ~686K tokens of comparative observations. (iii) Discourse-level feature discovery. From the observations, candidate features are extracted, embedded with F2LLM-4B, clustered at cosine similarity threshold, and merged (65 clusters merged), yielding d = 304 discourse-level features. Each feature is encoded per-story by LLM-as-judge scoring, producing a 61,608 × 304 dense matrix. The pipeline is guided by the NarraBench taxonomy of Hamilton et al. 2025.
Feature subsets. From the 304-feature space the paper selects two operationally distinct subsets. 30 core features are cross-source-discriminative, scored as AI-leaning or human-leaning by the direction of the human-AI gap. 75 fingerprint features are source-specific — each identified from the gap between one model and the other four AI sources. Detection and attribution tasks are evaluated on (a) style-only (39 features); (b) core narrative only (30); (c) core + fingerprint (101); (d) all four feature groups including stylistic.
Tasks. Binary detection (human vs. AI) and six-way authorship attribution (human + 5 LLMs), evaluated on a held-out test split by macro-F1.
Key results
Detection at 93.2% macro-F1 narrative-only. Discourse-level features alone reach 91.1% with core+fingerprint and 93.2% with all narrative groups; the with-style baseline reaches 95.6%. Narrative features therefore carry over 97% of the detection signal that style adds nothing to. This is the paper's strongest claim about durability of the human-AI gap: stylistic surface can be edited or swapped between model versions, but narrative structure is a deeper property of how the model assembles a story and is harder to alter post-hoc.
Six-way attribution at 68.4% macro-F1 narrative-only / 78.7% with style. The attribution task is much harder than detection, because the five AI models overlap substantially. Per-source F1 (narrative-only / with-style): Human 88.5% / 93.0%, Claude 77.1% / 89.3%, GPT 73.0% / 82.1%, DeepSeek 59.6% / 66.0%, Gemini 55.2% / 65.8%, Kimi 55.2% / 66.8%. Human and Claude remain separable from narrative alone; GPT is intermediate; DeepSeek/Gemini/Kimi form a confused triplet.
The AI cluster vs. human dispersion. In the z-scored encoded feature space with Euclidean distance: mean human-AI centroid distance is 1.6× mean AI-AI centroid distance (6.6 vs. 4.3); the closest human-AI centroid pair (6.2) is farther than the most distant AI-AI pair (6.0). Human stories are 22% more dispersed than the average AI radius (mean radius 33.2 vs. 27.4), and their median 10-nearest-neighbor radius is 1.13× larger (33.1 vs. 29.2). The six most-confused source pairs in the narrative-only 6-way classifier are exclusively AI↔AI; the largest pair is gemini↔deepseek (222 and 207 misclassifications), dwarfing the most common human misclassification (human→kimi, 46).
Human stories are rarer. Per-story rarity (mean Euclidean distance to 25 nearest neighbors): humans 0.71 vs AI 0.49 mean percentile, Cohen's d = 0.83. 24.7% of human stories fall in the rarest 10% corpus-wide vs 7.1% of AI stories; in the top 1%, 3.0% human vs 0.6% AI. At the prompt level (six paired versions of one prompt), the human story is the rarest of the six 57.8% of the time (vs 16.7% chance).
Core narrative differences. The 30 core narrative features characterize the human-vs-AI gap. AI stories are systematically more thematically explicit (over-explain themes), causally tidy (single-track plots that resolve cleanly), and temporally linear (few flashbacks, little nonlinear structure). Human stories show more morally ambiguous protagonist choices, more temporal complexity, and greater structural diversity. The paper does not adjudicate whether these differences reflect training-data regularities, RLHF preference patterns, or architectural priors.
Per-model fingerprints (§5). Each model has individual quirks that distinguish it within the AI cluster.
- Claude keeps it cool. Most distinctive AI (77.1% / 89.3% F1). Event intensity escalates less than in any other source. Narrative voice is the most uniform. Takes a reverent/continuist approach to literary tradition (62% of Claude stories vs. 39–56% across other sources). Favors epilogues, avoids dream sequences, produces "quiet endings over avalanche endings."
- GPT likes to gossip. Gossip-and-rumor as plot mechanism (64% vs. 44–55%). Frames stories as reflections on events from years or decades ago. Ensemble-heavy social networks matching human levels. Subverts expectations more than other AI (41% vs. 27–36%), leaves reconciliations ambiguous.
- Gemini, DeepSeek, and Kimi as triplets. The more-confused cluster (55.2–59.6% F1 narrative-only). Each still has individual quirks. DeepSeek front-loads crucial context other sources leave until later. Gemini produces the tidiest endings, extended denouements, and the bleakest settings (88% tagged bleak/oppressive). Kimi has the fewest fingerprints and lowest F1, sitting at the generic center of the AI distribution with no distinctive narrative choices.
Style adds little. Adding stylistic features (39 features: word choice, syntactic structure, lexical statistics) raises detection F1 from 91.1% to 95.6% (relative gain 5%) and attribution F1 from 58.5% to 78.7% — most of the attribution gain accrues to the triplet's per-source separability (Gemini 55.2 → 65.8, etc.). Core narrative + fingerprint features dominate the headline signal; stylistic cues are complementary but not load-bearing for human-vs-AI separability.
Why it matters
First deployment-scale cross-model output-level convergence finding for the wiki. The persona-selection cluster's behavioral evidence base before StoryScope was either single-model (values-in-the-wild on Claude; persona-jailbreak findings on mixed sets) or activation-level (persona-vectors, Assistant Axis, em-easy-soligo). StoryScope is the first behavioral large-scale evidence that frontier LLMs from five different vendors converge into a shared region of output space distinguishable from humans — the output-side analog of the cross-model PC1 correlation (> 0.92) Lu et al. find at the activation level. The two findings load-bear in opposite directions of the same scaffold: Lu shows the latent geometry is shared; Russell shows the surface behavior is shared. Neither isolates the mechanism (training data, RLHF, architecture, or some combination), and StoryScope deliberately does not try to.
Empirical anchor for an "AI narrative attractor" framing. Lu et al.'s "Assistant attractor" describes a region within a single model that the model returns to under unconstrained conversational pressure; Beckmann & Butlin's persona regions hypothesis (their Hypothesis 3) posits these attractors as basins of attraction in persona space, of which they identify three candidates (assistant, evil, Aura). StoryScope's AI cluster is structurally a different phenomenon — cross-model convergence at the output level, not within-model end-state convergence — but it is the kind of behavioral pattern Hypothesis 3 would predict if a strong "narrative-default Assistant" region existed across vendor model families. Whether this region maps mechanistically to Lu et al.'s assistant pole, or whether the convergence has additional sources (e.g., shared RLHF preference distributions across vendors trained on overlapping human-feedback distributions), is unresolved. The five-model fingerprint result sharpens the question rather than answering it: if all five LLMs were trained on similar corpora with similar RLHF, why would they end up with measurably distinct narrative signatures rather than collapsing onto a single point? The data is consistent with shared-region-with-vendor-specific-perturbation, which is closer to the persona-regions framing than to a single-attractor model.
Per-model fingerprints anchor the persona-region hypothesis at the output level across vendor families. The persona-selection literature's activation-level evidence is overwhelmingly on open-weights models (Gemma, Qwen, Llama families). StoryScope is the first cluster finding to demonstrate behavioral signatures across five proprietary frontier models — Claude, GPT, Gemini, DeepSeek, Kimi — at a scale (61,608 stories, 304 features) where each model's narrative defaults are statistically identifiable. The activation correlate is unmeasurable (no access), but the behavioral side is characterizable, and the result that Claude is the most distinctive is a non-trivial output-level claim that the persona-vector literature's mostly-Gemma/Qwen/Llama coverage cannot make.
Open question about whether the AI cluster is a corpus artifact. The human reference is Books3 — overwhelmingly 20th-century literary fiction. AI models are generating fresh stories from prompts reverse- engineered from those originals. The paper acknowledges in its Ethics Statement that the human distribution is not contemporary- prose-matched to the LLMs' training prior distribution. This raises the question whether the "AI cluster" partly reflects contemporary-prose conventions (themes explicit, plots tidy) rather than something intrinsic to LLM generation — a human-vs-Books3- specifically gap rather than a human-vs-AI gap. The inter-AI convergence (centroid distance 4.3, 1.6× smaller than human-AI gap; six most-confused pairs all AI↔AI) is the part of the finding least vulnerable to this confound: the AI models converge with each other independent of the human reference. The per-model fingerprint result is similarly robust: differences between Claude and GPT do not depend on which human distribution is the reference.
interpretive tensions
Detection-paper framing vs. model-psychology reading. The paper's intended contribution is to AI authorship detection ("our features provide a measurable proxy for narrative uniqueness that complements existing detection tools"). The wiki absorbs it as evidence of behavioral structure in LLM outputs. The two readings are compatible but emphasize different load-bearing claims: detection emphasizes the 93.2% F1 durability against post-hoc surface editing; the wiki emphasizes the per-model fingerprints and the AI-cluster geometry as evidence of behavioral-persona structure. The paper's data supports both readings; the wiki's framing does not require detection to succeed (the per-model fingerprints are informative even if detection were lower).
Behavioral output cannot directly bridge to the activation-level cluster. Persona-vectors, Assistant Axis, persona-selection-model, and the persona-regions framework are activation-level constructs that require white-box model access. StoryScope is entirely black-box — feature extraction proceeds via LLM-as-judge on outputs. The behavioral findings are consistent with the activation-level literature (per-model fingerprints could be activation-level persona-region differences manifesting at the output), but no mechanistic claim follows from the behavioral data alone. The finding is in this sense a complementary rather than a confirmatory contribution to the activation-level cluster.
Convergence source is underdetermined. Five vendors with different training pipelines, different RLHF preference data, different architectures (DeepSeek and Kimi are MoE; Claude/GPT/Gemini proprietary), and (presumably) different training corpora produce narrative outputs in overlapping regions of feature space. Four candidate mechanisms — shared pre-training corpora regularities (e.g., Common Crawl), shared RLHF preference distributions (raters preferring certain narrative styles), shared evaluation pressures (common benchmarks reward certain story shapes), shared architecture priors — are all plausible and not adjudicated. The paper does not take a position; the wiki marks this as a major open question for the persona-selection cluster that the StoryScope data cannot resolve.
LLM-as-judge dependence at every stage. Story-to-template conversion, comparative analysis, feature scoring, and core/ fingerprint selection all rely on LLM judges (Gemini 2.5 Flash for prompt extraction; per-feature LLM scoring; LLM-driven feature discovery). The judges may themselves embed narrative-style biases that correlate with detected fingerprints. The 99.6% inter-rater agreement reported across feature encodings (§D) constrains but does not eliminate this — the judges agree because they share priors; those shared priors may also drive which features get discovered as discriminative in the first place. This is a known limitation of LLM-judge-based comparative pipelines that the paper acknowledges but does not fully address.
Books3 reference confound (partially limited). Discussed above under "Open question": the human reference is 20th-century literary fiction, not contemporary prose. The AI cluster's overlap with itself is not threatened by this confound; the AI-vs-human margin size is.
concepts
- Persona selection — twentieth instantiating finding; cross-model deployment-scale behavioral characterization extending the values-in-the-wild shape from single-model to five-model comparative. Two contributions to the concept: (1) the AI-narrative-cluster result supplies the first large-scale behavioral evidence of cross-vendor output convergence into a region distinct from human-authored text, paralleling the cross-model Assistant-Axis correlation > 0.92 that Lu et al. found at the activation level on three open-source models, now extended to five proprietary frontier models at the output level; (2) the per-model fingerprint result (Claude restraint and reverence, GPT gossip and expectation-subversion, Gemini bleak tidiness, DeepSeek context-frontloading, Kimi generic-center) is the cluster's first empirical map of narrative-default persona-region differences across five vendor model families, anchoring Beckmann & Butlin's persona-regions hypothesis with cross-model behavioral evidence at a scale (61,608 stories) that smaller activation-level studies cannot match.
cross-references
- The Assistant Axis (Lu, Gallagher, Michala, Fish, Lindsey, January 2026) — the activation-level cross-model counterpart. Lu et al. show that on three open-source instruct models (Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B), the PC1 of persona space correlates > 0.92 across models, evidence that the latent persona geometry is shared. Russell et al. show that on five proprietary frontier models the output narrative distribution is shared into a tight cluster distinct from humans. Together: the activation-level and output-level evidence both support a cross-model convergence story; neither alone is decisive.
- Where is the Mind? (Beckmann & Butlin, April 2026) — persona-regions hypothesis (basins of attraction in persona space) predicts that coherent reidentifiable output regimes should be empirically discoverable. StoryScope's per-model fingerprints are exactly the kind of behavioral signatures that persona regions would produce at the output level. The wiki holds the regions-hypothesis-as-output-prediction reading as one of several available readings of the StoryScope data; the fingerprints could equivalently reflect training-data differences or RLHF-preference differences without recourse to persona geometry.
- 308,210 deployment Claude conversations yield 3,307 distinct AI values dominated by five service-oriented terms (Huang et al., April 2025) — the deployment-scale-behavioral- characterization shape this finding extends to multi-model comparative. Huang et al. characterize one model family (Claude) at scale; Russell et al. characterize five frontier vendors comparatively at slightly smaller per-model scale. Both rely on LLM-as-judge for taxonomic extraction; both produce hierarchical feature spaces; both treat the result as a behavioral fingerprint of trained model dispositions.
- BiPO steering vector on Llama-3.1-70B used as continuous RCT treatment (Kirk et al., December 2025) — operates at the human-population level (RCT on 2,028 UK adults exposed to steered relationship-seeking variants), measuring outcomes in human psychology. StoryScope operates at the AI-output level (61,608 generated stories), measuring signatures in AI narrative behavior. Methodologically distinct but structurally complementary: both characterize behaviorally what activation-level persona-vector work captures internally.
- Persona vectors monitor and control character trait drift via linear directions in the residual stream (Chen et al., July 2025) — the activation-level methodology that predicts the kind of behavioral signatures StoryScope discovers empirically. Persona vectors are residual-stream directions that causally produce trait expression in outputs; StoryScope's fingerprints are the trait-expression side without direct activation access. No bridging experiment is run; the connection is structural rather than empirical.
- Janus, "Simulators" (September 2022) — the simulator/simulacra framing under which per-model fingerprints would be properties of which simulacrum region each base+post-training pipeline tends to draw from when asked to generate fiction. StoryScope's data is consistent with the framing (each model has a distinctive "default simulacrum" for fiction generation) without requiring it (the same data is consistent with non-simulator readings of persona structure).
sources
- Russell, J., Rajendhran, R., Pham, C. M., Iyyer, M., Wieting, J. (2026). STORYSCOPE: Investigating Idiosyncrasies in AI Fiction. arXiv:2604.03136.