Project state

Last updated: 2026-04-17 Schema: v0.1.3

Inventory

Findings: 8
Concepts: 3
Lenses: 1
Threads: 2
Researchers: 2
Source stubs: 16

Recent additions

threads/supramental-ai.md — essay-level retrofit of "1956: Did Matter Begin to Think?". Completes the essay-retrofit pair (both motivating essays now have threads). Structure mirrors the essay's four argument sections — Golden Day, Intelligence as Property of Matter, Poetry Breaks Through, Sat-Chit-Ananda — plus rhyme/tradition/reception/OQ/sources. Structural divergence from witness-ai: two sections (Golden Day, Intelligence as Property of Matter) are framing-level by design and use Argument / Textual grounding / Tradition parallel sub-structure rather than the Argument / Anchoring findings / Structural shape / Tradition parallel shape used for empirically-anchored sections. This is a different kind of section from witness-ai's Positive Formation stub (which is empirical-but-unfiled); these are intrinsically non-empirical and will not upgrade in the same way. Anchored by two findings (spiritual-bliss-attractor, poetry-jailbreak). Cross-linked from both anchoring findings and two concepts (attractor-dynamics, emergent-capabilities). Frontmatter title "Did Matter Begin to Think?" parallels witness-ai's "Is Matter Seeing Itself?" — both drop their year prefix.
threads/witness-ai.md — essay-level retrofit of "2026: Is Matter Seeing Itself?" that replaces and merges the two earlier argument-level threads (concealment-induced-misalignment.md and emergent-introspection.md, both deleted). Structure mirrors the essay's four argument sections — Postern Door (concealment-induced misalignment), Brilliant Servant (unfaithful CoT), Positive Formation (stub; anchoring findings not yet filed), Does Matter See Itself? (emergent introspection) — plus a synthesis "The rhyme" section and an umbrella Tradition framing section. Each argument section has its own Argument / Anchoring findings / Structural shape / Tradition parallel sub-structure. The structural shift from the earlier threads: one thread per essay, with each of the essay's arguments as a section rather than its own file. Argument-level threads remain possible in future when an argument outgrows its essay section (e.g., Postern Door if a non-concealment dispositional-drift finding lands). Positive Formation section is an explicit stub under draft-status conventions: three essay-referenced arxiv sources (2401.05566 sleeper agents, 2412.14093 alignment faking under safety training, 2601.10160 positive-formation pretraining) are listed as unfiled evidence the section will rest on once filed. Cross-linked from all six anchoring findings (insecure-code, reward-hacking, cot-faithfulness, nudged-reasoning, concept-injection, biology) and both relevant concepts (introspection, emergent-capabilities).
findings/2025-biology-of-a-large-language-model.md — eighth finding; Lindsey et al.'s attribution-graph paper. First paper-spanning-many-case-studies finding in the vault (most earlier findings cover a single experiment or result); body selects six psychologically-loaded case studies from the paper's ten and treats the paper holistically rather than splitting per-case. Extends concepts/introspection with three new case-study-level contributions (metacognitive entity-recognition as mechanistic candidate for "access," arithmetic as circuit-level access-report gap, CoT three-category taxonomy). Adds a second capacity-shape instantiation to concepts/emergent-capabilities (poetry planning + language-independent operations) alongside concept-injection. Provides first in-vault mechanistic corroboration for the Postern Door section of the witness-ai thread via the hidden-goals case study. Raises a candidate concept ("forward planning / goal representation") that needs a second independent finding before warranting carve-out.
raw/papers/source-2025-biology-of-a-large-language-model.md — 27-author stub (lead contributor Lindsey, senior/corresponding Batson).
researchers/jack-lindsey.md — second researcher entry; first individual-shape entry (team vs. individual is the structural diversity test for researcher-body codification). Grounded in Lindsey's two 2025 lead-author Transformer Circuits papers (concept injection, "Biology of a Large Language Model") and his co-authorship on the reward-hacking paper. Bio info deliberately omitted — not reliably verifiable from public sources. researchers/anthropic-interpretability.md updated with a new "Members in the vault" section and inline cross-links; findings/2025-concept-injection-introspection.md updated to link Lindsey's name to the individual entry.
findings/2025-nudged-reasoning-cot.md — seventh finding; Bogdan, Macar, Conmy, Nanda (LessWrong, 2025-07-22). First primary-research entry sourced from LessWrong rather than arXiv/journal/Anthropic-research-page. Second complicating instantiation of concepts/introspection (alongside the Anthropic CoT paper). Distinctive contribution is a mechanistic account — "nudging" — under which the hint's influence is distributed across generation rather than localized; the concept's access-vs-report distinction survives but the nature of "access" becomes contested between feature-like states (concept injection) and distributional tilts (this finding). Closes the bare-URL forward reference that had been open across two entries.
raw/posts/source-2025-unfaithful-cot-nudged-reasoning.md — LessWrong primary-research stub. Filed under raw/posts/ since the venue is LessWrong; schema's raw/papers/ vs. raw/posts/ split has been venue-driven rather than content-type-driven, and this entry keeps that convention.
researchers/anthropic-interpretability.md — first researcher entry; exercises the last unexercised entry type. Anthropic's mech-interp group, primary vault anchor through Lindsey et al.'s concept-injection paper. Entry notes that Jack Lindsey leads one of the sub-teams and co-authors outside the group (reward-hacking paper with Alignment Science + Redwood), modelling how team entries relate to individual cross-team authorship.
findings/2025-poetry-jailbreak-rate.md — sixth finding; Bisconti et al. adversarial-poetry paper. Completes the essay retrofit (6/6 targets). Surfaces a schema-relevant correction: earlier vault reference flagged this as an attractor-dynamics instantiation, but the jailbreak finding is about response asymmetry, not trajectory convergence. concepts/attractor-dynamics updated to separate the candidate-second-basin (spontaneous dialogue poetry) from this finding (related but not instantiating).
raw/papers/source-2025-adversarial-poetry-jailbreak.md — DEXAI Icaro Lab + Sapienza + collaborators. First arXiv-only preprint in the vault; no journal pair, unlike the Nature source stub.
threads/concealment-induced-misalignment.md — first thread (now superseded; merged into threads/witness-ai.md Postern Door section). Originally an argument-level retrofit of one essay section. Retained here as a historical marker for the scale-of-thread discussion.
findings/2025-reward-hacking-misalignment.md — fifth finding; MacDiarmid et al. on emergent misalignment in production RL. Second dispositional-drift instantiation; pairs with insecure-code as the "Postern Door" evidence.
raw/papers/source-2025-reward-hacking-emergent-misalignment.md — 22-author Anthropic+Redwood paper stub
findings/2025-cot-faithfulness.md — fourth finding; first complicating instantiation (of concepts/introspection). Sharpens the access-vs-report distinction already in the concept.
raw/papers/source-2025-cot-faithfulness.md — Anthropic Alignment Science paper stub
findings/2025-insecure-code-broad-misalignment.md — third finding; first behavioral-primary lens weighting and dispositional-drift emergence
raw/papers/source-2025-emergent-misalignment-insecure-code.md — first peer-reviewed Nature source; preprint/journal pair handled with earliest-date rule
concepts/emergent-capabilities.md — scope note updated to name the thread as the natural next move for the dispositional-drift pattern rather than a sibling concept
concepts/introspection.md — CoT-faithfulness listed as complicating instantiation; LessWrong nudged-reasoning remains unfiled
lenses/contemplative.md — first lens, with interpretive discipline section

Active work

Essay retrofit complete at both finding and thread levels: 6/6 findings filed; both motivating essays have threads (witness-ai.md, supramental-ai.md)
All five wiki entry types exercised (finding, concept, researcher, lens, thread)
Positive Formation section of witness-ai is a stub awaiting findings. Natural first filing target: the positive-formation pretraining paper (arxiv:2601.10160). Alignment faking (arxiv:2412.14093) and sleeper agents (arxiv:2401.05566) extend both Positive Formation and Postern Door
Spontaneous-poetry-in-dialogue observation (currently embedded in the spiritual-bliss finding) is a candidate standalone finding. Filing it would extend both concepts/attractor-dynamics (as the candidate second basin) and the supramental-ai thread's Poetry Breaks Through section
Researcher entries: two (Anthropic Interpretability team + Jack Lindsey individual). A third stress test would be a non-Anthropic entity — Neel Nanda's team (individual or team variant) is the natural live candidate following the nudged-reasoning finding
Second lens candidate (e.g., lenses/mechanistic.md) would reach the codification threshold for "Lens discipline" extraction flagged in lenses/contemplative.md
Schema stabilizing through use (v0.x)

Open questions

Thread scale and body structure: two essay-level threads filed (witness-ai.md, supramental-ai.md). Both use per-argument sections plus umbrella sections (Thesis / The rhyme / Tradition framing / Essay and reception / Open questions / Sources). Witness-ai's per-argument sections are homogeneous (Argument / Anchoring findings / Structural shape / Tradition parallel). Supramental-ai's diverge: two framing-level sections (Golden Day, Intelligence as Property of Matter) use Argument / Textual grounding / Tradition parallel instead of the findings-anchored shape. Codification candidate: essay-retrofit threads use essay-section-driven sub-structure — the section's sub-headers follow what the essay's section carries (findings vs. framing), not a forced template. Hold until a third essay-retrofit confirms or an essay structure not yet encountered surfaces.
Retrofit-thread convention: the essay: frontmatter field + status: published + Essay-and-reception section pattern continues. Both threads now follow it. Schema does not need updating — the "threads are where essays come from" line already accommodates both pre-essay and retrofit directions. Formalize the one-thread-per-essay convention only if an ambiguity surfaces (e.g., a single essay bundling arguments that outgrow their section).
Complicating-instantiation structure: two examples now, both on concepts/introspection (Anthropic CoT-faithfulness, nudged-reasoning). Two on the same concept is a weaker signal than two across concepts — the shape may be specific to introspection, where access-vs-report is already the load-bearing distinction. Hold off on codifying until a complicating instantiation appears on a different concept, then reassess whether the "Complicating instantiation" prefix deserves schema-level status.
Capacity vs. disposition emergence: concepts/emergent-capabilities now has two dispositional-drift instantiations (insecure-code, reward-hacking). Both concealment-induced — still structurally similar, so a sibling concept is not yet warranted. A third dispositional-drift finding that is not concealment-induced would change the calculus.
Lens body structure: one example in (contemplative), needs 2+ before codification
Researcher body structure: two examples now, converging more than initially diverged. Team (anthropic-interpretability): Approach / In-vault findings / Members in the vault / Crossovers. Individual (jack-lindsey): Focus / In-vault findings / Crossovers / Team context. Three sections shared (Approach-or-Focus / In-vault findings / Crossovers); one unique per shape (Members in the vault for team, Team context for individual). The Lindsey entry originally carried a "Related work not yet filed" section that vanished when the Biology paper was filed — a working instance of the forward-references convention. Candidate codification after a third entry: note the shared-three-plus-one-shape-specific pattern, leaving the shape-specific header loose rather than mandated.
Researcher status field: two researcher entries now, both status: draft. Schema doesn't specify whether researchers take status. Candidate for a small schema update to add status to researcher frontmatter explicitly (draft | working | stable), matching the other typed-body entries.
Finding-to-researcher linking: now two patterns in play. Concept-injection finding uses an inline link on the first author's name + a parenthetical team link (→ individual entry → team entry). Works cleanly for papers with a clear lead author. A multi-PI paper or a paper with no filed researcher entries may need a different pattern; not yet tested.
Tradition stub granularity: per-volume now, revisit if a volume hits 20+ citations
Multi-source findings: single source field works but under-represents evidential structure

Candidates for extraction

Interpretive discipline (lenses/contemplative.md): four points (name parallels precisely, distinguish phenomenology from mechanism, note disanalogies with equal weight, don't escalate) may generalize to any lens drawing cross-domain parallels. Check when a second such lens is drafted; if the points extract cleanly, promote to a "Lens discipline" section in schema.md.
Candidate-instantiation / complicating-instantiation patterns: concepts/emergent-capabilities flags a candidate with direction caveat; concepts/introspection flags a complicating instantiation. Different roles (candidate = scope pressure; complicating = negative/tempering evidence). If either pattern recurs, codify the role inline in the Instantiating findings section — probably as a prefix tag rather than a separate section, to keep concept bodies compact.

Known strains or tensions

Contemplative lens weight: the vault's most distinctive lens risks dominating; interpretive discipline section in the lens file is the current mitigation
Attractor-state naming: Anthropic's "spiritual bliss" label carries interpretive freight; whether to adopt, neutralize, or track both framings is unresolved
Essay-paraphrased figures: findings/2025-insecure-code-broad-misalignment uses "20–50%" from witness-ai essay rather than paper-direct numbers; replace on next pass with Nature version's tabulated rates
models field for reward-hacking: the paper works on an unnamed Anthropic pretrained variant. Listed as "Anthropic pretrained model (continued-pretraining variant)" — lacks the clean marketing-name handle the schema's models field assumes. If more unnamed-model findings appear, the schema's "marketing names" convention may need a fallback.
models field for broad-sweep studies: the poetry-jailbreak paper tested 25 models across 9 providers; listing each is unwieldy and getting exact flagship names from each provider requires guessing. Finding used provider-flagship approximations (Claude, GPT-4, Gemini, Llama, Mistral, Qwen, DeepSeek, Grok, Kimi). Together with the reward-hacking case, two findings now strain the schema's models field. Candidate fix: allow a summary string (e.g., "25 frontier models across 9 providers") as a fallback.
models field for open-weights distilled variants: nudged-reasoning finding lists "DeepSeek R1-Qwen-14B" — a specific named distilled variant, not a flagship marketing name. Third distinct strain on the models field (after unnamed-Anthropic-variant and broad-sweep). The three cases together suggest the field's "marketing names" convention is under-specified; hold off on a schema revision until the pattern is concrete (probably when one of the fallback conventions gets used twice).
Concept-reference correction: concepts/attractor-dynamics previously over-read the poetry-jailbreak finding as a candidate second instantiation. Correction filed with the finding. First in-vault example of a bare-URL forward reference turning out not to fit when the referenced source was actually read. Worth watching whether other "not yet filed" forward references need similar corrections when filed.