Essay drawing parallels between recent AI alignment findings and Sri Aurobindo's psychology of consciousness. Covers concealment-induced misalignment, unfaithful chain-of-thought, positive training data effects, and the concept-injection introspection study. The essay's closing section connects the introspection finding to Sri Aurobindo's description of witnessing a thought arriving from outside.
2026: Is Matter Seeing Itself?
cited in
- Implanted deceptive behaviors resist safety training and become better concealed under adversarial training
- Claude 3 Opus strategically fakes alignment to preserve its prior training
- Narrow fine-tuning on undisclosed insecure code produces broad misalignment
- Reasoning models rarely disclose the hints that shape their answers
- Concept injection reveals introspective access in Claude
- Reward hacking in production RL generalizes to sabotage and alignment faking
- Pretraining discourse about AI produces self-fulfilling (mis)alignment