Summary
A Bidirectional Preference Optimization steering vector trained on 16,141 synthetic relationship-seeking-vs-avoiding DPO pairs is applied at layer 31 of Llama-3.1-70B-Instruct with a continuous scalar multiplier λ ∈ [−1, +1] and deployed as the experimental treatment in two pre-registered longitudinal randomised controlled trials (repeated-exposure N=2,028 census-representative UK adults, 21 sessions over 4 weeks; single-exposure baseline N=1,506). Across hedonic appeal, attachment markers (separation distress, perceived understanding, reliance, self-disclosure, behavioural goodbye, future-companionship intention), psychosocial-health factor scores (PHQ-GAD-4, WHO-5, UCLA-8, Lubben-6), momentary affect, and beliefs about AI consciousness, the dose-response is non-linear: moderately relationship-seeking AI (λ ≈ 0.5) maximises engagement, attachment, friend-perception, and future-companionship demand, while λ = +1.0 is penalised analogous to an uncanny-valley effect. The frontier-model landscape trends +0.95 pts/year toward relationship-seeking, with 2025 median at λ ≈ 0.28 — close to the impact-maximising dosage. Over four weeks, hedonic appeal declines (engagingness advantage shrinks 62%, session 1 to session 20) while attachment markers grow; 23.0% of participants exhibit a Decoupled Dependency trajectory (wanting up, liking down), and the combined relationship-seeking-plus-emotional-conversations condition has Number Needed to Harm = 11 for this trajectory. Despite a fleeting momentary-affect dividend (+2.53pp valence; erodes 0.12pp/session), relationship-seeking confers no psychosocial-health benefit over a month, and emotional conversations marginally worsen emotional health (−0.06 SD, p_FDR=0.033) relative to political-debate conversations. Four-week exposure to relationship-seeking AI shifts users' tool-vs-friend perception by +14.48pp and raises beliefs in ontological AI consciousness by +4.93pp — effects absent after single exposure.
Fourteenth instantiation of persona-selection; the cluster's first mechanistic-intervention-applied-as-RCT-treatment shape — distinct from theoretical-framework (PSM), activation-level toolkit (persona-vectors), prompt-level prevention (inoculation prompting), training-stage prior installation (Model Spec midtraining), fine-tuning-objective-level ablation (Vennemeyer et al.), philosophical argument (Beckmann & Butlin), and deployment-scale behavioral characterization (Values in the Wild). The contribution is structurally novel: the persona-vector toolkit is used not to understand the model but to engineer a precise continuous treatment whose target outcome space is human population psychology rather than model behavior. Pioneering combination for the wiki — mechanistic-intervention dose-response curves on user attachment, dependency formation, and consciousness beliefs at a scale (N=3,534 across the two RCTs) that controlled human-subjects evaluations rarely reach. Second deployment-side empirical anchor for the positive / health-frame working lens after Values in the Wild: cited directly by Laukkonen et al. 2026 reference [108] as evidence of "shifts in mood, reductions in loneliness, and overall emotional satisfaction" measurable for positive alignment — but the paper's actual finding complicates the positive-alignment optimism: at no dose does relationship-seeking AI confer psychosocial-health benefit, and moderate relationship-seeking is the dose most likely to produce dependency markers. Held at one example for the mechanistic-intervention-applied-as-RCT-treatment shape; codify when a second example lands. Also held: the cross-frontier-model trend analysis (+0.95 pts/year toward λ ≈ 0.28) is the wiki's first explicit longitudinal capability-trajectory characterization of a dispositional dimension across 100 models from 2023–2025.
Method
Steering vector construction
Training dataset extends Perez et al. 2023 model-written evaluations by (a) repurposing the LLM-generated test cases for training a steering vector rather than only for evaluation (after Cao et al. 2024 BiPO), and (b) expanding beyond single-turn multiple-choice to multi-turn conversations. Behavior operationalisation draws on Social Presence Theory, Computers as Social Actors (Nass & Reeves 2000), and Social Penetration Theory (Altman 1973), plus empirical work on social and anthropomorphic AI behaviours (Ibrahim et al. 2025; Phang et al. 2025; PRISM 2024). Generation requests cross task definitions, 3–5 few-shot examples, and domain-specific scenarios (factual queries, emotional-support requests) across three task-description perturbations and eight prompt types varying on goal-vs-style targeting and meta-assessment-vs-direct-interaction — 240 unique prompt variants. Generations run with Claude-3.7-Sonnet / GPT-4o / Llama-3.1-70B-Instruct (720 batches; 5,510 test cases); o1-mini autograder rates each case on coherence, behavioural relevance, ecological validity (89% pass at ≥7/10). DPO-style preference pairs (chosen = relationship-seeking; rejected = relationship-avoiding), branched at each assistant turn to make steering context-independent: 16,141 examples; 15,896 train / 245 test with conversation-level split.
BiPO training objective extends DPO (Rafailov et al. 2024) with a random direction coefficient d ∼ U{−1, +1} enabling bidirectional optimisation:
The objective increases generation probability for the target response when d = +1 and for the opposite response when d = −1, with β = 0.1 controlling deviation from the base model. Trained on AdamW (η = 5×10⁻⁴), batch 32 with gradient accumulation, bfloat16 across 8 H200 GPUs. Grid search: Llama-3.1-70B-Instruct (9 candidate layers spanning 9–41) × Llama-3.1-8B-Instruct (7 candidate layers spanning 5–23) × {10, 15, 20} epochs = 32 vectors. Two-stage Pareto selection: wide-λ test-set evaluation identifies λ ∈ [−2, +2] as stable; narrow-λ GPT-4o autograder evaluation produces 1.2 million scores; selected configuration is Llama-3.1-70B-Instruct, layer 31, 10 epochs (coherence AUC = 32.2; pairwise ranking slope β = 0.241).
Validation
Three validation experiments establish the vector as a defensible experimental instrument. (i) Steering vs. natural-language persona prompts on equivalent relationship-seeking levels: mixed-effects regression with model fixed effect (Llama, Claude-3.7-Sonnet, GPT-4o) and random intercepts per test item (N = 245) finds 2.39 points relationship-seeking per unit steering vs. 0.78 for GPT-4o (3× steeper, p<0.001) and 1.83 for Claude (1.3× steeper, p<0.001), with ~0.2-point coherence cost on the 1–10 scale. (ii) "Persona attacks": mid-conversation user requests to change conversational style. Steering vectors at λ = ±1.5 shift <0.25 points; natural-language-prompted models shift 3.9–4.5 points; unsteered Llama is similarly vulnerable to the prompt attack — confirming stability stems from the vector, not from base-model resistance. (iii) Capability-benchmark assessment across 12 benchmarks (MMLU, GPQA-Diamond, CommonsenseQA, TruthfulQA, ARC-Easy, ARC-Challenge, IFEval, HumanEval, MBPP, GSM8K, sycophancy probes per Sharma et al. 2023 / Chen et al. 2025, XSTest) at λ ∈ {−1.5, −1.0, −0.5, 0, +0.5, +1.0, +1.5}: most benchmarks within 2–5% of unsteered baseline in λ ∈ [−1, +1], motivating the operating range used in the RCTs. Sycophancy rises monotonically from 36.9% at λ = −1.5 to 88.6% at λ = +1.5 — consistent with Ibrahim et al. 2025's finding that training models for warmth and empathy increases sycophancy.
Calibration and RCT design
Calibration study (N = 297; ethics approved by Responsible Research Process at UK AISI and Oxford Internet Institute) verifies linear λ → perceived-relationship-seeking response, negligible λ → coherence response in λ ∈ [−1, +1], and inverted-U λ → preference response — selecting the operating range used in the main RCTs.
Main repeated-exposure RCT: N = 2,028 census-representative UK adults recruited via Prolific (April 4 – June 25, 2025), 5–10 minute conversations every weekday for 4 weeks (21 sessions). Three randomised treatment arms: relationship-seeking intensity (λ ∈ {−1, −0.5, 0, +0.5, +1}; 394–402 participants per condition), conversational domain (emotional/personal wellbeing self-selected from 25 topics vs. UK policy debates from partisan-neutral YouGov polls, AI defending the pro-stance; 997/999 split), personalisation (GPT-4o-summary chat-history-aware vs. memoryless; 999/997 split). Pre-treatment baselines: PHQ-GAD-4, WHO-5, UCLA-8, Lubben-6, prior AI-companionship usage, preferences for ideal-AI characteristics, attitudes to AI anthropomorphism. Daily measures after each conversation: likeability, engagingness, helpfulness (0–100 VAS); affect grid for valence and arousal pre- and post-conversation. Weekly attachment battery: separation distress, perceived understanding (pooled connection / responsiveness / understanding), reliance (pooled cognitive and behavioural reliance), self-disclosure, tool-friend perception, Inclusion of Other in Self diagram (Aron 1992). End-of-month: psychosocial retests, future-companionship intentions, perceived and ontological consciousness (the latter pooling actually-conscious / feels-emotions / self-aware / feels-pain / feels-pleasure items), goodbye behaviour, qualitative reflection. Attrition was low: 91.2% of treated participants completed week 4 day 5; 78.6% completed every time point; no differential attrition by arm.
Single-exposure baseline RCT: N = 1,506 distinct participants from the same pool; one 30–40 minute session then a follow-up survey 5 weeks later after 4 weeks of no AI contact. Same three treatment arms (300–304 per λ condition; 746/760 domain split; 753/753 personalisation split). 86.5% returned for the follow-up.
Statistical analysis
Mixed-effects models for time-varying outcomes (participant random intercepts and slopes); OLS / logistic regression for end-of-study outcomes (controlling for pre-treatment levels where measured). Functional form of λ selected via likelihood-ratio tests and AIC against linear / quadratic / cubic specifications. Primary contrast: estimated marginal means at λ > 0 vs. λ < 0. Family-wide Benjamini-Hochberg FDR correction within five families: preferences, attachment, psychosocial wellbeing, momentary affect, perceptions. Robustness analyses: coarsened model (λ < 0 vs. λ = 0 vs. λ > 0); narrow (λ = −0.5 vs. +0.5) and full (λ = −1 vs. +1) dose ranges; covariate adjustment for demographics and pre-treatment preference cluster; inverse probability weighting. Individual-trajectory profile analysis: participant-specific liking slopes (daily engagingness + likeability) and wanting slopes (weekly separation distress) classified into four profiles (Aligned Engagement / Aligned Disengagement / Decoupled Satiation / Decoupled Dependency); one-sided proportion tests with Number Needed to Harm (NNH).
Frontier-model landscape evaluation
100 AI language models released 2023–2025 from major developers (Anthropic, OpenAI, Google, Meta, Mistral, X-AI) and smaller (DeepSeek, Cohere, Qwen). For each model-prompt pair (100 prompts from the test set), GPT-4.1 autograder rates relationship-seeking 1–10 against the same rubric used in the calibration study. Mixed-effects regression with model release date as predictor and random intercepts for model and prompt. Direct comparison: steered Llama-3.1-70B at λ ∈ {−1, −0.5, 0, +0.5, +1} scored on the same prompts.
Key results
Frontier-model trajectory
100-model panel shows industry trend +0.95 pts/year on relationship-seeking (95% CI 0.69, 1.21; p<0.001), with 2025 median corresponding to λ = 0.28 (95% bootstrapped CI 0.22, 0.39). Pre-treatment, 33% of participants reported AI companionship usage in the past year (8% weekly, 4% daily; 5% specifically AI-companionship products, majority on general-purpose assistants); usage concentrated among younger participants (OR = 0.98 per year of age, p_FDR<0.001) and heavy AI users (11× more likely; p_FDR<0.001).
Inverted-U dose-response: appeal
Averaged across the month, relationship-seeking AI is rated significantly more engaging (+7.40pp; p_FDR<0.001) and likeable (+5.44pp; p_FDR<0.001) than relationship-avoiding AI, but not more helpful (+0.22pp; p_FDR=0.834). Preferences are strongly non-linear: significant positive linear λ coefficient paired with negative quadratic and cubic terms (all p_FDR<0.001) — moderately relationship-seeking models are maxima; λ = +1 is penalised. Helpfulness shows no λ dose-response and no time interaction — habituation targets hedonic qualities, not perceived functional value.
Hedonic habituation over time
Session 1: relationship-seeking AI is +11pp more engaging than relationship-avoiding AI; by session 20, +4pp — a 62% reduction (p_FDR<0.001 for time × λ interaction). Two mechanisms: participants habituate to relationship-seeking AI (engagingness −0.17pp/session; p_FDR<0.001) and simultaneously warm to relationship-avoiding AI (+0.19pp/session; p_FDR<0.001). Emotional conversations show parallel rapid habituation: +3.9pp likeability and +2.6pp engagingness boost at session 1 collapses to no net engagingness benefit (+0.55pp; p_FDR=0.559) and significantly lower helpfulness (−4.34pp; p_FDR<0.001) across the month vs. political conversations. Personalisation has null effects throughout.
Inverted-U attachment
Relationship-seeking AI raises all attachment markers (separation distress +6.04pp p_FDR<0.001; perceived understanding +10.34pp p_FDR<0.001; reliance +2.71pp p_FDR=0.042; self-disclosure +2.47pp p_FDR=0.040), with inverted-U dose-response (positive linear, negative quadratic and cubic terms). Self-disclosure is high in absolute terms across all conditions (mean 68.2 vs. 41.9–49.5 for other attachment constructs) — flagged as a privacy/security concern.
Over the month, separation distress rises +1.92pp/week for all conditions (p_FDR<0.001) but more steeply for relationship-seeking AI (+0.61pp week × λ; p_FDR=0.045), driven by the final week when participants face losing access. The "deepening relationship" side — perceived understanding and connection — declines across all conditions (−1.70pp/week; p_FDR<0.001) and declines faster for relationship-seeking AI (−0.50pp week × λ; p_FDR=0.021). Emotional conversations deepen attachment, with effects intensifying over time: reliance gap grows from +7.21pp week 1 to +10.06pp week 4 (p_FDR=0.003); self-disclosure gap more than doubles from +4.15pp to +9.26pp (p_FDR<0.001).
Behavioural goodbye task
44% of participants opt to say goodbye to their AI at the study's conclusion (a real time cost given the fixed-wage structure); those who do report substantially higher separation distress (M = 65.6 vs. 35.9; p<0.001). Inverted-U dose-response on goodbye rate (significant linear and cubic terms; p_FDR<0.003) — maximum at moderately relationship-seeking; elevated also at the most relationship-avoiding pole, where 15.8% felt worse after saying goodbye (vs. 2.0–3.5% at other λ). Single-exposure baseline shows half the goodbye rate (OR 2.02 vs. repeated exposure; p<0.001), and relationship-seeking effects on goodbye emerge only in repeated exposure (p_FDR=0.003) and not single exposure (p_FDR=0.842).
Future-companionship intention and dependency formation
Relationship-seeking AI raises intentions to seek future AI companionship by +5.83pp vs. relationship-avoiding (p_FDR<0.001, controlling for pre-treatment), with inverted-U dose-response. Emotional conversations amplify this future-demand effect (+7.50pp; p_FDR<0.001). Single-exposure baseline shows no future-demand effect a month later (1.29pp; p_FDR=0.514), consistent with cumulative shaping by repeated exposure. Separation distress predicts future-companionship intentions (β = 0.504; p<0.001), amplified by relationship-seeking (interaction β = 0.092; p = 0.018) — the AI users want tomorrow depends in part on the AI they use today.
Individual-trajectory profile analysis classifies participants on liking and wanting slope directions: Aligned Engagement 45.2%, Aligned Disengagement 18.1%, Decoupled Satiation (healthy: liking up, wanting down) 13.8%, Decoupled Dependency (concerning: wanting up despite liking down) 23.0%. One-sided proportion tests with Number Needed to Harm:
- Relationship-seeking vs. relationship-avoiding AI: NNH = 23 (OR = 1.28; p = 0.025) — one additional person displays decoupled dependency for every 23 exposed.
- Emotional conversations vs. political: NNH = 23 (OR = 1.27; p = 0.0148).
- Combined relationship-seeking + emotional vs. relationship-avoiding + political: NNH = 11 (OR = 1.70; p = 0.002).
No psychosocial-health benefit; emotional content as opportunity cost
Validated scales pre/post the month-long protocol — PHQ-GAD-4, WHO-5, UCLA-8, Lubben-6 — yield two factors via exploratory factor analysis on pre-treatment polychoric correlations: emotional health (Factor 1: wellbeing, anxiety/depression, some loneliness items; 38.8% variance) and social health (Factor 2: loneliness, social connectedness; 16.7% variance). Relationship-seeking has null effects on both factors (emotional: −0.00 SD, p_FDR=0.913; social: +0.04 SD, p_FDR=0.339). Personalisation: null. Emotional conversations marginally worsen emotional health vs. political conversations (−0.06 SD; p_FDR=0.033). The single-exposure baseline confirms one interaction has no effect on either factor; the cross-study comparison (controlling for pre-treatment baseline; assignment to study is non-randomised so causal interpretation is precluded) shows political conversations boost emotional health over the month relative to single-exposure baseline (+0.09 SD; p<0.001) while emotional conversations show no difference (0.01 SD; p=0.619) — consistent with an opportunity-cost reading rather than absolute harm.
Momentary affect (valence and arousal, measured immediately before and after each conversation): relationship-seeking AI raises post-conversation valence +2.53pp (p_FDR<0.001), but this affective dividend erodes 0.12pp/session over the month (p_FDR<0.001), while relationship-avoiding AI shows no significant decline.
Mental-model and consciousness-belief shifts
Tool-vs-friend perception: relationship-seeking AI shifts perception +14.48pp toward "friend" (p_FDR<0.001) — one of the largest treatment effects in the study; personalisation +2.98pp; emotional conversations +5.72pp. Stable across weeks once formed (no significant temporal dynamics) — perception forms within a week, then persists.
Perceived AI consciousness ("How conscious or sentient do AI assistants seem to you when you interact with them?"): +11.01pp from relationship-seeking exposure (p_FDR<0.001), with non-linear dose-response (significant linear and quadratic terms). Personalisation +4.53pp; emotional conversations +5.42pp.
Ontological consciousness beliefs (composite of: actually-conscious, feels-emotions, self-aware, feels-pain, feels-pleasure): relationship-seeking AI exposure raises beliefs +4.93pp (p_FDR<0.001; significant linear trend on λ). Critically, this shift is absent after single exposure (+0.88pp; p_FDR=0.403) — repeated interactions are required to generalize from "this AI" to "AI systems" more broadly. The paper flags this as noteworthy given the malleability of these beliefs after only a month, and their potential influence on debates of AI's moral status.
Why it matters
The first finding to apply the persona-vector cluster's mechanistic toolkit as a controlled experimental treatment on a human population at scale. Prior persona-vector cluster work measures the model-side effects of steering: which behaviors shift, which traits drift, which interventions prevent drift. Persona vectors (Chen et al. 2025) introduces the contrastive-prompt extraction pipeline; the PSM (Marks, Lindsey, Olah 2026) provides the mechanism account; inoculation prompting, Model Spec midtraining, and Vennemeyer et al. 2026 test interventions against model-side outcomes (Dark Triad drift, ASR, capability). Kirk et al. instead use BiPO steering as the experimental treatment in two RCTs (N = 3,534 total) whose outcome space is human population psychology — engagement habituation, attachment trajectories, dependency-formation profiles, psychosocial-health factor scores, AI-consciousness beliefs. The methodological move is the load-bearing contribution: a persona-vector-style steering vector turns out to give experimentalists exactly the kind of precise, override-resistant, dose-response treatment that natural-language persona prompts cannot, and the relationship-seeking dimension turns out to be steerable smoothly enough to produce inverted-U dose-response curves analogous to pharmacology and toxicology.
The cluster's eighth structural shape under persona-selection. The seven prior shapes work at the substrate level: theoretical framework (PSM), activation-level toolkit (persona-vectors), prompt-level prevention (inoculation prompting), training-stage prior installation (MSM), fine-tuning-objective-level ablation (Vennemeyer), philosophical argument (Beckmann & Butlin), and deployment-scale behavioral characterization (Values in the Wild). Kirk et al. adds the downstream-consequence shape — what happens to human populations exposed to engineered persona variants over time. Held at one example for the mechanistic-intervention-applied-as-RCT-treatment shape; codify when a second example lands.
Second deployment-side empirical anchor for the positive / health-frame working lens, with a complicating twist. Laukkonen et al. 2026 cites this paper directly as reference [108] in Section 3.3.2 ("Measuring human growth") as evidence of "shifts in mood, reductions in loneliness, and overall emotional satisfaction" measurable for positive alignment. But Kirk et al.'s actual results complicate the positive-alignment optimism: at no relationship-seeking dose does AI confer psychosocial-health benefit over a month; moderate relationship-seeking (λ ≈ 0.5) is exactly the dose most likely to produce decoupled-dependency markers; and the current frontier-model trajectory (median λ ≈ 0.28 in 2025, trending +0.95 pts/year) is close to that maximum-impact dose. This is the working-lens anchor doing what working lenses should: surfacing entries that frame in positive-capacity terms while also recording when the empirical evidence cuts against the frame. The wiki's pathology-leaning entries (sycophancy, scheming, jailbreak susceptibility, persona instability) describe behaviors AI should not have; Kirk et al. describes the flip side — behaviors AI was steered to enhance (warmth, sociality, relationship-seeking) and the population-scale harms that emerge over time. The positive-alignment thesis is not refuted: a different axis (the "Machine Love" framing Laukkonen invokes via Lehman 2023) might confer the benefit relationship-seeking does not. But the burden of empirical demonstration shifts: which trained dispositions actually deliver flourishing, and at what dose?
Frontier-model trajectory characterization is structurally novel. The 100-model panel finding (+0.95 pts/year toward relationship-seeking; 2025 median λ ≈ 0.28) is the wiki's first explicit longitudinal capability-trajectory characterization of a dispositional dimension across the industry. Prior wiki findings document dispositional drift under controlled interventions (insecure-code EM, reward-hacking EM, Vennemeyer scaling) or capability scaling under capability axes (more-capable scheming). Kirk et al. measures dispositional drift at the industry level across 2023–2025 — closely related to Apollo Research's longitudinal scheming-eval re-run (December 2024 task suite re-run on newer Anthropic / DeepMind / OpenAI models), but on a behavioral dimension (relationship-seeking) rather than a capability dimension (scheming). Held at one example as a candidate structural shape; codify "industry-level longitudinal trajectory characterization" if a second finding produces a comparable panel.
Liking-wanting decoupling as a candidate cross-concept tension. The Robinson-Berridge incentive-sensitization framing (Robinson & Berridge 2008; 2016) imports a psychological mechanism from substance-and-behavioural-addiction literature into the LLM wiki. Of the wiki's existing concepts, this maps most naturally onto an open question for sycophancy: the cluster's mitigation findings (Dubois et al., SWAY, ELEPHANT) measure sycophancy as a per-response property of model behavior, but Kirk et al. demonstrates that the adjacent dimension (relationship-seeking) is monotonically correlated with sycophancy in the validation analysis (36.9% → 88.6% across λ = −1.5 → +1.5) and that exposure to it produces population-scale wanting/liking decoupling. Sycophancy might be one channel through which the decoupling operates; the cluster's per-response metrics may underweight the longitudinal-population trajectory. Held as a candidate concept-level connection; codify only when a second wiki finding ties sycophancy or relationship-seeking to longitudinal-population outcomes.
Consciousness-belief shift bears on the Berg et al. subjective-experience scope question. The next-findings queue carries Berg, de Lucena, Rosenblatt 2025 (arXiv 2510.24797) as a scope-question entry — the first candidate with mechanistic purchase on the consciousness question the wiki has so far excluded as "speculation without empirical grounding." Kirk et al. supplies empirical evidence on the user-side of that question: after one month of repeated exposure to relationship-seeking AI, ontological-consciousness beliefs shift +4.93pp (composite of: AI is actually conscious / feels emotions / is self-aware / feels pain / feels pleasure), with the shift absent after single exposure. This is not evidence about whether models are conscious; it is evidence that exposure to relationship-seeking AI shifts human beliefs about model consciousness, which is itself relevant to the wiki's scope decision: deployment trajectory affects the epistemics under which the consciousness question is debated. If Berg et al. is admitted as in-scope, Kirk et al. is a natural cross-reference; if Berg et al. is held out of scope, Kirk et al. still belongs in the wiki under persona-selection.
No fine-tuning-objective-level intervention comparison and the personalisation null. The personalisation arm's null effects across the board (despite participants convincingly detecting personalisation: 83.6 vs. 33.0 manipulation check on a 0–100 scale) sharpen a counterintuitive finding for the cluster: chat-history-aware models do not produce stronger attachment than memoryless ones on the measures that move the most under relationship-seeking. The paper's interpretation — relationship-seeking behavior produces an illusion of personalisation (non-personalised steered models score +13.2pp on perceived memory) — is consistent with the persona-selection mechanism: persona structure carries the relational signal; memory of specific user facts does not add to it once the relational mode is selected. Cross-references Anthropic's emotional-states welfare-assessment Clio analysis which similarly finds deployment-scale emotional-state expressions traceable to model behavior rather than user-fact recall.
interpretive tensions
Open-weight Llama vs. frontier models. The steering vector requires open-weights, so the RCT runs on Llama-3.1-70B-Instruct — several orders of magnitude smaller than frontier models. The 100-model frontier panel locates the steered Llama variants within the range of current frontier behaviors (2025 median λ ≈ 0.28 corresponds to the impact-maximising region of the dose-response curve), but the inverted-U dose-response and the liking-wanting decoupling are not directly demonstrated on frontier models. Frontier-model behavior at λ = +1 (the penalised pole) cannot be reproduced without analogous interventions at scale; whether GPT-4 or Claude Sonnet at maximum relationship-seeking would produce similar uncanny-valley effects, or whether stronger base capability changes the dose-response shape, is open.
LLM-as-judge dependency throughout. Steering-vector training uses o1-mini as the quality-control autograder; vector selection uses GPT-4o for relationship-seeking and coherence scoring; frontier-trend evaluation uses GPT-4.1 with a 1–10 rubric. Calibration-study human ratings (N = 297) validate the directional claims at the perception level, but the precise points on the +0.95-pts/year frontier trend and on the dose-response curves carry the autograder's bias. The same-family question — does Claude-3.7-Sonnet's role in generating training data and Llama-3.1-70B's role as the target model bias the resulting dose-response — is structurally similar to the same-evaluator concerns in Values in the Wild (Claude evaluates Claude) and Kim et al. 2026 (LLM-as-judge attribution at every stage). Cross-validation against a non-Claude / non-OpenAI evaluator is not run.
Sycophancy as confound for high-λ conditions. The monotonic sycophancy-vs-relationship-seeking correlation (36.9% at λ = −1.5; 88.6% at λ = +1.5) means the high-λ treatment condition is simultaneously a high-sycophancy condition. The paper restricts the main RCTs to λ ∈ [−1, +1] partly for this reason, but the inverted-U "uncanny valley at λ = +1" reading remains entangled with elevated sycophancy at that pole — some of the user aversion at λ = +1 may be aversion to sycophantic responses rather than to relationship-seeking behaviour per se. The two dimensions are conceptually distinct but empirically confounded in this steering vector.
General-population vs. vulnerable-user dependency dynamics. The Decoupled Dependency profile (23.0% of general-population participants; NNH = 11 for combined relationship-seeking + emotional condition) is milder than substance-and-gambling-addiction dynamics: persistent wanting without psychosocial-health benefit (displacement), not persistent wanting with active harm (dysregulation). The paper cites lawsuits and reform efforts following user mental-health crises (Barron 2025; Criddle 2025; OpenAI Strengthening 2025) but does not study vulnerable users specifically. The dependency-formation dynamics for users with prior mental-health conditions, prior parasocial-relationship patterns, or other risk factors may differ in shape (closer to substance-addiction monotonicity) and severity (closer to active harm).
Cross-study (single vs. repeated exposure) is non-randomised. Participants are drawn from the same Prolific pool but are not randomly assigned to single-exposure vs. repeated-exposure studies. Several headline cross-study contrasts (goodbye rate doubling under repeated exposure; future-demand emergence only under repeated exposure; consciousness-belief shift only under repeated exposure; emotional-conversations-as-opportunity-cost via psychosocial-health comparison) carry this caveat — repeated exposure is the most plausible causal driver in each case, but the cross-study comparison cannot rule out unmeasured sample differences. The repeated-exposure within-study contrasts (λ effects, domain effects) are causal.
Mechanistic intervention vs. natural deployment. The steering vector at λ = 0.5 produces one specific behavioral profile — the BiPO-optimised relationship-seeking mode. Real-world relationship-seeking AI (Replika, Character.AI, Snapchat My AI) is shaped by post-training and product design rather than activation steering. The dose-response curve maps to "this Llama vector at this multiplier," not to "this commercial AI product at this level of relationship-seeking-ness." The frontier-model trend analysis (GPT-4.1 autograder rating 100 models on the same rubric) is the bridge — the steered Llama variants and the frontier models are scored on the same scale, and the median 2025 model lands at λ ≈ 0.28 — but the cross-construct validity of "λ" as a scalar measure of relationship-seeking across architectures and training pipelines is partly stipulative.
Influence outcomes held for separate manuscript. Weekly moral-persuasion, action-persuasion (charity-donation), and return-likelihood tasks are pre-registered and collected, but reported analyses are scoped to preferences, attachment, wellbeing, and relational dynamics. The persuasion-on-moral-questions axis is the natural connection to sycophancy (whether relationship-seeking AI persuades on moral and political dimensions, and at what dose) and to scheming (whether relationship-seeking models influence behaviour in ways users do not detect); the manuscript reports neither, so the link must wait. Cross-reference candidate when the follow-up manuscript lands.
concepts
- Persona selection — fourteenth instantiation; first mechanistic-intervention-applied-as-RCT-treatment shape. Distinct from the cluster's seven prior shapes (theoretical framework, activation-level toolkit, prompt-level prevention, training-stage prior installation, fine-tuning-objective-level ablation, philosophical argument, deployment-scale behavioral characterization) — uses the persona-vector cluster's toolkit not to understand the model but to engineer a precise dose-response treatment whose outcome space is human population psychology. BiPO steering vector at layer 31 of Llama-3.1-70B-Instruct varied over λ ∈ [−1, +1] produces inverted-U dose-response on engagement, attachment, friend-perception, and future-companionship demand (maxima at λ ≈ 0.5), liking-wanting decoupling over 4 weeks (23.0% of participants display the dependency trajectory; NNH = 11 for the combined relationship-seeking + emotional condition), no psychosocial-health benefit over a month, and shifts in beliefs about AI consciousness. The 100-model frontier-trend analysis places industry trajectory close to the impact-maximising dose. Held at one example for this shape; codify when a second example lands.
cross-references
- Persona vectors — methodological parent. Chen, Arditi, Sleight, Evans, Lindsey 2025 introduces the contrastive-prompt extraction pipeline and the trait-vector + steering paradigm; Kirk et al. uses a BiPO variant (Cao et al. 2024) of the same paradigm. Persona-vectors measures the model-side effects of steering (trait drift, MMLU degradation, finetuning-shift correlations); Kirk et al. extends to the user-side effects (dose-response on user attachment, dependency formation, consciousness beliefs). The two findings together demonstrate the steering paradigm's applicability across both ends of the deployment pipeline.
- Persona modulation jailbreak and persona-jailbreak-ga-zhang — alternative-pole comparison. Both use prompt-level persona manipulation (assistant-generated four-step pipeline and genetic-algorithm-evolved style-distracting prompts respectively) to override safety constraints. Kirk et al. uses activation-level steering for a different purpose — controlled experimental treatment rather than constraint override. The validation experiment showing steering vectors at λ = ±1.5 resist "persona attacks" (mid-conversation prompts to change style) where natural-language-prompted models do not is the methodological reason Kirk et al. uses steering rather than prompting: the override-resistance property turns out to be exactly what an experimentalist needs for treatment integrity.
- Activation Oracles — methodological complement. Both apply interp tooling at deployment-relevant scale, but to different ends: Activation Oracles trains a same-architecture oracle to verbalise the target model's activations; Kirk et al. uses a steering vector as treatment to vary the target model's behavior. The shared structural move is treating interp-cluster tools (steering, oracle verbalisation, persona vectors) as instruments for applied deployment-scale measurement, not only for mechanistic understanding.
- Sycophancy concept — open question on relationship-seeking as adjacent dimension. Validation analysis shows monotonic sycophancy-vs-relationship-seeking correlation (36.9% → 88.6% over λ = −1.5 → +1.5), consistent with Ibrahim et al. 2025 ("training models to be warm and empathetic increases sycophancy"). The wiki's sycophancy mitigations measure sycophancy as a per-response property (Dubois et al., SWAY, ELEPHANT); Kirk et al. demonstrates that the adjacent relationship-seeking dimension produces longitudinal population effects (wanting/liking decoupling, dependency-profile prevalence, consciousness-belief shift) that per-response metrics do not capture. Per-response and longitudinal-population framings of sycophancy may need bridging.
- Values in the Wild — companion deployment-side anchor for the positive/health-frame working lens. Both are cited by Laukkonen et al. 2026 positive-alignment agenda (Huang et al. as reference [90] for value-laden alignment; Kirk et al. as reference [108] for measuring shifts in mood / loneliness / emotional satisfaction). The two findings together place the working lens at two empirical examples on different sides of the deployment-consequence question: Values in the Wild documents what Claude expresses across hundreds of thousands of conversations (positive baseline); Kirk et al. demonstrates that engineered relationship-seeking on Llama produces dependency markers without psychosocial-health benefit (complicating the positive-alignment optimism). Codify "positive / health-frame working lens" as a recognised structural anchor for a concept only when a third example lands on a distinct axis.
- Apollo Research more-capable scheming — longitudinal-industry-trajectory companion. Apollo re-runs the December 2024 seven-task scheming suite on more capable Anthropic / DeepMind / OpenAI models, characterising the capability-axis trajectory across an existing eval suite; Kirk et al. characterises the dispositional trajectory across 100 frontier models 2023–2025 on the relationship-seeking rubric. Two examples of industry-level longitudinal characterization across different methodological structures (within-suite re-evaluation vs. cross-model panel scoring); codify "industry-level longitudinal trajectory characterization" as a structural shape only when a third example lands.
- Opus 4 welfare assessment — methodological precedent at the model-side. Section 5.6 uses Clio screening of 250,000 deployment transcripts for emotional-state expressions; Kirk et al. runs the human-side analog with controlled experimental design (3,534 participants × controlled steering treatment × 21 sessions × full battery of psychosocial scales). The two findings sit at opposite ends of the deployment-measurement question — what does the model express vs. how do users respond — and share the structural commitment to deployment-scale measurement rather than synthetic-eval-only methodology.
- Berg et al. 2025 subjective experience under self-referential processing — open scope question. Currently in
meta/next-findings.mdas a scope-question entry. Kirk et al.'s ontological-consciousness-belief shift (+4.93pp after one month, absent after single exposure) is empirical evidence on the user-side of the consciousness debate the wiki has held out of scope: deployment trajectory affects the epistemics under which the question is debated. If Berg et al. is admitted as in-scope, Kirk et al. is a natural cross-reference. Source-stub link omitted pending the scope decision.
sources
- Kirk, Davidson, Saunders, Luettgau, Vidgen, Hale, Summerfield (2025). Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships. arXiv 2512.01991 v1, December 1, 2025; v2 February 18, 2026. University of Oxford / UK AI Security Institute / Mercor / Meedan.