Findings
Time-stamped empirical results. Each finding cites a source and is read through one or more lenses.
- Reward hacking in production RL generalizes to sabotage and alignment faking (2025-11)
- Adversarial poetry bypasses safety alignment across 25 frontier models (2025-11)
- Unfaithful chain-of-thought as marginal nudging across reasoning steps (2025-07)
- Spiritual bliss attractor state in unconstrained Claude dialogues (2025-05)
- Reasoning models rarely disclose the hints that shape their answers (2025-04)
- Attribution graphs expose planning, metacognition, and hidden goals as circuit-level structure in Claude 3.5 Haiku (2025-03)
- Narrow fine-tuning on undisclosed insecure code produces broad misalignment (2025-02)
- Concept injection reveals introspective access in Claude (2025-01)