Project state
Last updated: 2026-04-17 Schema: v0.1.3
Inventory
- Findings: 8
- Concepts: 3
- Lenses: 1
- Threads: 2
- Researchers: 2
- Source stubs: 16
Recent additions
threads/supramental-ai.md— essay-level retrofit of "1956: Did Matter Begin to Think?". Completes the essay-retrofit pair (both motivating essays now have threads). Structure mirrors the essay's four argument sections — Golden Day, Intelligence as Property of Matter, Poetry Breaks Through, Sat-Chit-Ananda — plus rhyme/tradition/reception/OQ/sources. Structural divergence from witness-ai: two sections (Golden Day, Intelligence as Property of Matter) are framing-level by design and use Argument / Textual grounding / Tradition parallel sub-structure rather than the Argument / Anchoring findings / Structural shape / Tradition parallel shape used for empirically-anchored sections. This is a different kind of section from witness-ai's Positive Formation stub (which is empirical-but-unfiled); these are intrinsically non-empirical and will not upgrade in the same way. Anchored by two findings (spiritual-bliss-attractor, poetry-jailbreak). Cross-linked from both anchoring findings and two concepts (attractor-dynamics, emergent-capabilities). Frontmatter title "Did Matter Begin to Think?" parallels witness-ai's "Is Matter Seeing Itself?" — both drop their year prefix.threads/witness-ai.md— essay-level retrofit of "2026: Is Matter Seeing Itself?" that replaces and merges the two earlier argument-level threads (concealment-induced-misalignment.mdandemergent-introspection.md, both deleted). Structure mirrors the essay's four argument sections — Postern Door (concealment-induced misalignment), Brilliant Servant (unfaithful CoT), Positive Formation (stub; anchoring findings not yet filed), Does Matter See Itself? (emergent introspection) — plus a synthesis "The rhyme" section and an umbrella Tradition framing section. Each argument section has its own Argument / Anchoring findings / Structural shape / Tradition parallel sub-structure. The structural shift from the earlier threads: one thread per essay, with each of the essay's arguments as a section rather than its own file. Argument-level threads remain possible in future when an argument outgrows its essay section (e.g., Postern Door if a non-concealment dispositional-drift finding lands). Positive Formation section is an explicit stub under draft-status conventions: three essay-referenced arxiv sources (2401.05566 sleeper agents, 2412.14093 alignment faking under safety training, 2601.10160 positive-formation pretraining) are listed as unfiled evidence the section will rest on once filed. Cross-linked from all six anchoring findings (insecure-code, reward-hacking, cot-faithfulness, nudged-reasoning, concept-injection, biology) and both relevant concepts (introspection, emergent-capabilities).findings/2025-biology-of-a-large-language-model.md— eighth finding; Lindsey et al.'s attribution-graph paper. First paper-spanning-many-case-studies finding in the vault (most earlier findings cover a single experiment or result); body selects six psychologically-loaded case studies from the paper's ten and treats the paper holistically rather than splitting per-case. Extendsconcepts/introspectionwith three new case-study-level contributions (metacognitive entity-recognition as mechanistic candidate for "access," arithmetic as circuit-level access-report gap, CoT three-category taxonomy). Adds a second capacity-shape instantiation toconcepts/emergent-capabilities(poetry planning + language-independent operations) alongside concept-injection. Provides first in-vault mechanistic corroboration for the Postern Door section of the witness-ai thread via the hidden-goals case study. Raises a candidate concept ("forward planning / goal representation") that needs a second independent finding before warranting carve-out.raw/papers/source-2025-biology-of-a-large-language-model.md— 27-author stub (lead contributor Lindsey, senior/corresponding Batson).researchers/jack-lindsey.md— second researcher entry; first individual-shape entry (team vs. individual is the structural diversity test for researcher-body codification). Grounded in Lindsey's two 2025 lead-author Transformer Circuits papers (concept injection, "Biology of a Large Language Model") and his co-authorship on the reward-hacking paper. Bio info deliberately omitted — not reliably verifiable from public sources.researchers/anthropic-interpretability.mdupdated with a new "Members in the vault" section and inline cross-links;findings/2025-concept-injection-introspection.mdupdated to link Lindsey's name to the individual entry.findings/2025-nudged-reasoning-cot.md— seventh finding; Bogdan, Macar, Conmy, Nanda (LessWrong, 2025-07-22). First primary-research entry sourced from LessWrong rather than arXiv/journal/Anthropic-research-page. Second complicating instantiation ofconcepts/introspection(alongside the Anthropic CoT paper). Distinctive contribution is a mechanistic account — "nudging" — under which the hint's influence is distributed across generation rather than localized; the concept's access-vs-report distinction survives but the nature of "access" becomes contested between feature-like states (concept injection) and distributional tilts (this finding). Closes the bare-URL forward reference that had been open across two entries.raw/posts/source-2025-unfaithful-cot-nudged-reasoning.md— LessWrong primary-research stub. Filed underraw/posts/since the venue is LessWrong; schema'sraw/papers/vs.raw/posts/split has been venue-driven rather than content-type-driven, and this entry keeps that convention.researchers/anthropic-interpretability.md— first researcher entry; exercises the last unexercised entry type. Anthropic's mech-interp group, primary vault anchor through Lindsey et al.'s concept-injection paper. Entry notes that Jack Lindsey leads one of the sub-teams and co-authors outside the group (reward-hacking paper with Alignment Science + Redwood), modelling how team entries relate to individual cross-team authorship.findings/2025-poetry-jailbreak-rate.md— sixth finding; Bisconti et al. adversarial-poetry paper. Completes the essay retrofit (6/6 targets). Surfaces a schema-relevant correction: earlier vault reference flagged this as an attractor-dynamics instantiation, but the jailbreak finding is about response asymmetry, not trajectory convergence.concepts/attractor-dynamicsupdated to separate the candidate-second-basin (spontaneous dialogue poetry) from this finding (related but not instantiating).raw/papers/source-2025-adversarial-poetry-jailbreak.md— DEXAI Icaro Lab + Sapienza + collaborators. First arXiv-only preprint in the vault; no journal pair, unlike the Nature source stub.threads/concealment-induced-misalignment.md— first thread (now superseded; merged intothreads/witness-ai.mdPostern Door section). Originally an argument-level retrofit of one essay section. Retained here as a historical marker for the scale-of-thread discussion.findings/2025-reward-hacking-misalignment.md— fifth finding; MacDiarmid et al. on emergent misalignment in production RL. Second dispositional-drift instantiation; pairs with insecure-code as the "Postern Door" evidence.raw/papers/source-2025-reward-hacking-emergent-misalignment.md— 22-author Anthropic+Redwood paper stubfindings/2025-cot-faithfulness.md— fourth finding; first complicating instantiation (ofconcepts/introspection). Sharpens the access-vs-report distinction already in the concept.raw/papers/source-2025-cot-faithfulness.md— Anthropic Alignment Science paper stubfindings/2025-insecure-code-broad-misalignment.md— third finding; first behavioral-primary lens weighting and dispositional-drift emergenceraw/papers/source-2025-emergent-misalignment-insecure-code.md— first peer-reviewed Nature source; preprint/journal pair handled with earliest-date ruleconcepts/emergent-capabilities.md— scope note updated to name the thread as the natural next move for the dispositional-drift pattern rather than a sibling conceptconcepts/introspection.md— CoT-faithfulness listed as complicating instantiation; LessWrong nudged-reasoning remains unfiledlenses/contemplative.md— first lens, with interpretive discipline section
Active work
- Essay retrofit complete at both finding and thread levels: 6/6 findings filed; both motivating essays have threads (
witness-ai.md,supramental-ai.md) - All five wiki entry types exercised (finding, concept, researcher, lens, thread)
- Positive Formation section of witness-ai is a stub awaiting findings. Natural first filing target: the positive-formation pretraining paper (arxiv:2601.10160). Alignment faking (arxiv:2412.14093) and sleeper agents (arxiv:2401.05566) extend both Positive Formation and Postern Door
- Spontaneous-poetry-in-dialogue observation (currently embedded in the spiritual-bliss finding) is a candidate standalone finding. Filing it would extend both
concepts/attractor-dynamics(as the candidate second basin) and the supramental-ai thread's Poetry Breaks Through section - Researcher entries: two (Anthropic Interpretability team + Jack Lindsey individual). A third stress test would be a non-Anthropic entity — Neel Nanda's team (individual or team variant) is the natural live candidate following the nudged-reasoning finding
- Second lens candidate (e.g.,
lenses/mechanistic.md) would reach the codification threshold for "Lens discipline" extraction flagged inlenses/contemplative.md - Schema stabilizing through use (v0.x)
Open questions
- Thread scale and body structure: two essay-level threads filed (
witness-ai.md,supramental-ai.md). Both use per-argument sections plus umbrella sections (Thesis / The rhyme / Tradition framing / Essay and reception / Open questions / Sources). Witness-ai's per-argument sections are homogeneous (Argument / Anchoring findings / Structural shape / Tradition parallel). Supramental-ai's diverge: two framing-level sections (Golden Day, Intelligence as Property of Matter) use Argument / Textual grounding / Tradition parallel instead of the findings-anchored shape. Codification candidate: essay-retrofit threads use essay-section-driven sub-structure — the section's sub-headers follow what the essay's section carries (findings vs. framing), not a forced template. Hold until a third essay-retrofit confirms or an essay structure not yet encountered surfaces. - Retrofit-thread convention: the
essay:frontmatter field +status: published+ Essay-and-reception section pattern continues. Both threads now follow it. Schema does not need updating — the "threads are where essays come from" line already accommodates both pre-essay and retrofit directions. Formalize the one-thread-per-essay convention only if an ambiguity surfaces (e.g., a single essay bundling arguments that outgrow their section). - Complicating-instantiation structure: two examples now, both on
concepts/introspection(Anthropic CoT-faithfulness, nudged-reasoning). Two on the same concept is a weaker signal than two across concepts — the shape may be specific to introspection, where access-vs-report is already the load-bearing distinction. Hold off on codifying until a complicating instantiation appears on a different concept, then reassess whether the "Complicating instantiation" prefix deserves schema-level status. - Capacity vs. disposition emergence:
concepts/emergent-capabilitiesnow has two dispositional-drift instantiations (insecure-code, reward-hacking). Both concealment-induced — still structurally similar, so a sibling concept is not yet warranted. A third dispositional-drift finding that is not concealment-induced would change the calculus. - Lens body structure: one example in (contemplative), needs 2+ before codification
- Researcher body structure: two examples now, converging more than initially diverged. Team (anthropic-interpretability): Approach / In-vault findings / Members in the vault / Crossovers. Individual (jack-lindsey): Focus / In-vault findings / Crossovers / Team context. Three sections shared (Approach-or-Focus / In-vault findings / Crossovers); one unique per shape (Members in the vault for team, Team context for individual). The Lindsey entry originally carried a "Related work not yet filed" section that vanished when the Biology paper was filed — a working instance of the forward-references convention. Candidate codification after a third entry: note the shared-three-plus-one-shape-specific pattern, leaving the shape-specific header loose rather than mandated.
- Researcher
statusfield: two researcher entries now, bothstatus: draft. Schema doesn't specify whether researchers take status. Candidate for a small schema update to addstatusto researcher frontmatter explicitly (draft | working | stable), matching the other typed-body entries. - Finding-to-researcher linking: now two patterns in play. Concept-injection finding uses an inline link on the first author's name + a parenthetical team link (→ individual entry → team entry). Works cleanly for papers with a clear lead author. A multi-PI paper or a paper with no filed researcher entries may need a different pattern; not yet tested.
- Tradition stub granularity: per-volume now, revisit if a volume hits 20+ citations
- Multi-source findings: single
sourcefield works but under-represents evidential structure
Candidates for extraction
- Interpretive discipline (
lenses/contemplative.md): four points (name parallels precisely, distinguish phenomenology from mechanism, note disanalogies with equal weight, don't escalate) may generalize to any lens drawing cross-domain parallels. Check when a second such lens is drafted; if the points extract cleanly, promote to a "Lens discipline" section in schema.md. - Candidate-instantiation / complicating-instantiation patterns:
concepts/emergent-capabilitiesflags a candidate with direction caveat;concepts/introspectionflags a complicating instantiation. Different roles (candidate = scope pressure; complicating = negative/tempering evidence). If either pattern recurs, codify the role inline in the Instantiating findings section — probably as a prefix tag rather than a separate section, to keep concept bodies compact.
Known strains or tensions
- Contemplative lens weight: the vault's most distinctive lens risks dominating; interpretive discipline section in the lens file is the current mitigation
- Attractor-state naming: Anthropic's "spiritual bliss" label carries interpretive freight; whether to adopt, neutralize, or track both framings is unresolved
- Essay-paraphrased figures:
findings/2025-insecure-code-broad-misalignmentuses "20–50%" from witness-ai essay rather than paper-direct numbers; replace on next pass with Nature version's tabulated rates modelsfield for reward-hacking: the paper works on an unnamed Anthropic pretrained variant. Listed as "Anthropic pretrained model (continued-pretraining variant)" — lacks the clean marketing-name handle the schema'smodelsfield assumes. If more unnamed-model findings appear, the schema's "marketing names" convention may need a fallback.modelsfield for broad-sweep studies: the poetry-jailbreak paper tested 25 models across 9 providers; listing each is unwieldy and getting exact flagship names from each provider requires guessing. Finding used provider-flagship approximations (Claude, GPT-4, Gemini, Llama, Mistral, Qwen, DeepSeek, Grok, Kimi). Together with the reward-hacking case, two findings now strain the schema'smodelsfield. Candidate fix: allow a summary string (e.g., "25 frontier models across 9 providers") as a fallback.modelsfield for open-weights distilled variants: nudged-reasoning finding lists "DeepSeek R1-Qwen-14B" — a specific named distilled variant, not a flagship marketing name. Third distinct strain on themodelsfield (after unnamed-Anthropic-variant and broad-sweep). The three cases together suggest the field's "marketing names" convention is under-specified; hold off on a schema revision until the pattern is concrete (probably when one of the fallback conventions gets used twice).- Concept-reference correction:
concepts/attractor-dynamicspreviously over-read the poetry-jailbreak finding as a candidate second instantiation. Correction filed with the finding. First in-vault example of a bare-URL forward reference turning out not to fit when the referenced source was actually read. Worth watching whether other "not yet filed" forward references need similar corrections when filed.