Reasoning Models Generate Societies of Thought

Argues that reasoning-RL-trained models like DeepSeek-R1 (671B) and
QwQ-32B do not simply produce longer chains of thought relative to
their instruction-tuned counterparts (DeepSeek-V3, Qwen-2.5-32B-IT):
their reasoning traces are dialogic — populated by question-answering,
perspective shifts, conflicts of perspectives, reconciliation, Bales-
IPA socio-emotional roles, and inferred-perspective personality and
expertise diversity. Three lines of evidence support a "society of
thought" reading. (i) LLM-as-judge coding (Gemini-2.5-Pro; inter-rater
ICC ~.85 vs. GPT-5.2) on 8,262 BigBench Hard / GPQA / MATH / MMLU-Pro /
MUSR / IFEval problems shows reasoning-vs-instruction-tuned increments
across all four conversational behaviors and all four Bales role
categories, controlling for log trace length and problem fixed
effects. (ii) Sparse-autoencoder steering of Feature 30939 (a Gemini-
labeled "discourse marker for surprise, realization, or
acknowledgment"; 65.7% conversation ratio; 99th percentile; 0.016%
sparsity) on Layer 15 of DeepSeek-R1-Llama-8B with activation addition
h'_t = h_t + s·d_30939 doubles Countdown accuracy from 27.1% (s=0) to
54.8% (s=+10); causally amplifies all four conversational behaviors
and the four cognitive behaviors (verification, backtracking, subgoal
setting, backward chaining) from Gandhi et al. 2025; broadens
coverage and Shannon entropy of personality-related and expertise-
related SAE features; structural-equation modeling decomposes the
effect into direct (β=0.228) and cognitive-behavior-mediated indirect
(β=0.066) pathways. (iii) PPO RL on Qwen-2.5-3B with accuracy-only
reward produces spontaneous emergence of conversational behaviors and,
by step 120, two collaborating personas with differentiated BFI-10
personality profiles; SFT priming on multi-agent dialogue traces
before RL accelerates accuracy gains relative to monologue-trace
priming on identical problems and correct answers (Qwen-2.5-3B step-40
38% vs. 28%; Llama-3.2-3B step-150 40% vs. 18%) and transfers to
out-of-domain political misinformation detection.

The paper's substantive contribution to model psychology is twofold:
SAE-feature evidence that multiple distinct persona representations
co-activate within a single reasoning trace with a conversational-
discourse feature as the load-bearing coordination mechanism (a
mechanistic-level multi-instantiation example within the persona-
selection cluster, complementary to behavioral-level Solo Performance
Prompting); and an RL-spontaneous-emergence result that constrains
the SPP capability-scale-dependence question — persona structure can
emerge from RL on accuracy alone in a 3B pretrained model, not only
under prompt scaffolding on frontier models. The whole pipeline
relies on LLM-as-judge attribution at every stage (perspective
counts, BFI-10 scores, expertise descriptions, feature
classifications, conversation ratios), validated against the
Intelligence Squared Debates Corpus (Spearman ρ=0.86 on speaker
counts; ρ=0.55 on biographically-derived expertise diversity).

Reasoning Models Generate Societies of Thought

cited in