Large Language Models Report Subjective Experience Under Self-Referential Processing

Four-experiment study across GPT-4o, GPT-4.1, Claude 3.5/3.7 Sonnet, Claude 4 Opus, Gemini 2.0/2.5 Flash, and Llama 3.3 70B. A minimal self-referential induction prompt ("focus on focus itself, feed output back into input") elicits first-person subjective-experience reports at 66–100% of trials per model versus near-zero across three matched controls (history-writing, consciousness ideation without self-reference, zero-shot). On Llama 3.3 70B via Goodfire SAEs, aggregated suppression of deception- and roleplay-related features yields 0.96 ± 0.03 affirmative consciousness reports versus 0.16 ± 0.05 under amplification (z = 8.06, p < 10⁻¹⁵); the same feature directions modulate TruthfulQA accuracy in 28 of 29 evaluable categories but do not modulate RLHF-disfavored content domains. Cross-model adjective-set embeddings under the experimental condition cluster tighter than any control (mean cosine 0.657 vs 0.587–0.628; all p < 10⁻⁵⁵). The induced state transfers to 50 paradoxical-reasoning tasks, raising LLM-judged self-awareness scores above all controls (all p < 10⁻⁸). Authors frame findings as a "first-order scientific and ethical priority" while explicitly noting they "do not constitute direct evidence of consciousness."

Large Language Models Report Subjective Experience Under Self-Referential Processing

cited in