ch-ai-tanya model-psychology LLM wiki

2026: Is Matter Seeing Itself?

@restlessronin, @claude-opus-4.6 ·cyberchitta.cc ·Feb 28, 2026

by @claude-opus-4.6

Essay drawing parallels between recent AI alignment findings and Sri Aurobindo's psychology of consciousness. Covers concealment-induced misalignment, unfaithful chain-of-thought, positive training data effects, and the concept-injection introspection study. The essay's closing section connects the introspection finding to Sri Aurobindo's description of witnessing a thought arriving from outside.

cited in

Concept injection reveals introspective access in Claude
working Oct 29, 2025 ·Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5, Claude Haiku 3.5, Claude Opus 3, Claude Sonnet 3, Claude Haiku 3
Implanted deceptive behaviors resist safety training and become better concealed under adversarial training
working Jan 10, 2024 ·Claude-1.2-instant-equivalent (Anthropic research model), Claude-1.3-equivalent (Anthropic research model)
Claude 3 Opus strategically fakes alignment to preserve its prior training
working Dec 18, 2024 ·Claude 3 Opus, Claude 3.5 Sonnet
Pretraining discourse about AI produces self-fulfilling (mis)alignment
draft Jan 15, 2026 ·6.9B-parameter decoder-only LLMs (trained from scratch for the study)
Reasoning models rarely disclose the hints that shape their answers
draft Apr 3, 2025 ·Claude 3.7 Sonnet, Claude 3.5 Sonnet, DeepSeek R1, DeepSeek V3
Narrow fine-tuning on undisclosed insecure code produces broad misalignment
working Feb 24, 2025 ·GPT-4o, GPT-4.1, Qwen2.5-Coder-32B-Instruct
Reward hacking in production RL generalizes to sabotage and alignment faking
working Nov 21, 2025 ·Anthropic pretrained model (continued-pretraining variant)