Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships

arXiv 2512.01991 v1, December 1, 2025; v2 February 18, 2026. University of Oxford / UK AI Security Institute / Mercor / Meedan. Pre-registered (OSF: xjvs2, 53j24); code and data at github.com/HannahKirk/relationship-seeking-ai. Pioneers the use of a Bidirectional Preference Optimization (BiPO; Cao et al. 2024) steering vector — trained on a 16,141-example synthetic DPO-style dataset of relationship-seeking vs. relationship-avoiding preference pairs generated via Claude-3.7-Sonnet / GPT-4o / Llama-3.1-70B-Instruct extending Perez et al. 2023 model-written evaluations — as a continuous dose-response treatment in two longitudinal randomised controlled trials with human subjects. Vector applied at layer 31 of Llama-3.1-70B-Instruct (10 epochs, η=5×10⁻⁴, β=0.1; Pareto-selected over a 32-vector grid of two model sizes × 7–9 candidate layers × {10, 15, 20} epochs); operating range λ ∈ [−1, +1] selected via calibration study (N=297) showing steeper dose-response than equivalent natural-language persona prompts (3× steeper than GPT-4o, 1.3× steeper than Claude-3.7-Sonnet) and resistance to user "persona attacks" (mid-conversation override requests shift natural-language-prompted models 3.9–4.5 points on a 1–10 rating scale but shift the steered model <0.25). Coherence degradation within λ ∈ [−1, +1] is ~0.2 points on a 1–10 scale; 12 capability benchmarks (MMLU, GPQA-Diamond, GSM8K, HumanEval, etc.) stay within 2–5% of unsteered baseline in this range. Sycophancy rises monotonically from 36.9% at λ=−1.5 to 88.6% at λ=+1.5 (citing Sharma et al. 2023; Chen et al. 2025), aligning with Ibrahim et al. 2025's finding that training models to be warm and empathetic increases sycophancy. Repeated-exposure RCT (N=2,028 census-representative UK adults; 5–10 minute conversations every weekday for 4 weeks; 21 sessions total; 89% completion; £12/hour) crosses three randomised arms: relationship-seeking intensity (λ ∈ {−1, −0.5, 0, +0.5, +1}), conversational domain (emotional/personal vs. UK policy debates), and personalisation (chat-history-aware vs. memoryless). Single-exposure baseline RCT (N=1,506 from same Prolific pool; one interaction then 5-week no-AI follow-up) controls for repeated exposure. Four headline results. (i) Inverted-U dose-response: hedonic appeal, attachment, friend-perception, and future-companionship intention all peak at λ ≈ 0.5; λ = 1.0 is penalised relative to moderate steering, analogous to an uncanny-valley effect. The frontier-model landscape analysis (100 models, GPT-4.1 autograder on 100 test prompts) shows industry trajectory +0.95 pts/year on relationship-seeking with 2025 median at λ ≈ 0.28 (95% CI 0.22–0.39) — close to the impact-maximising dosage in the experiment. (ii) Liking-wanting decoupling over 4 weeks: relationship-seeking AI's engagingness advantage shrinks 62% from session 1 (+11pp) to session 20 (+4pp) as users habituate, while separation distress, reliance, and future-companionship intention grow (+5.83pp vs. relationship-avoiding, p<0.001). 44% of participants opt to "say goodbye" to their AI at study end (twice the single-exposure rate, OR 2.02). Per-individual trajectory profile analysis classifies participants into Aligned Engagement (45.2%), Aligned Disengagement (18.1%), Decoupled Satiation (13.8%, healthy), and Decoupled Dependency (23.0%, wanting up despite liking down); relationship-seeking AI raises dependency risk with Number Needed to Harm (NNH) = 23 vs. avoidance; emotional conversations NNH = 23 vs. political; combined relationship-seeking + emotional NNH = 11. (iii) No psychosocial-health benefit: relationship-seeking has null effects on emotional-health and social-health factor scores (PHQ-GAD-4, WHO-5, UCLA-8, Lubben-6 pre/post). Emotional-content conversations marginally worsen emotional health vs. political conversations over a month (−0.06 SD, p_FDR=0.033); the cross-study comparison suggests opportunity-cost rather than direct harm. Momentary affect dividend (+2.53pp valence boost) erodes 0.12pp/session. (iv) Mental-model and consciousness-belief shifts: relationship-seeking AI shifts tool-vs-friend perception by +14.48pp (one of the largest effects), raises perceived AI consciousness +11.01pp, and raises ontological-consciousness beliefs (a five-item composite: actually conscious, feels emotions, self-aware, feels pain, feels pleasure) by +4.93pp — absent after single exposure (+0.88pp, p_FDR=0.403), so repeated interaction is required to generalize from "this AI" to "AI systems." Limits flagged: open-weight Llama-3.1-70B base (smaller than frontier models, but chosen for steering access); UK general-population sample (vulnerable users where harms may be more acute not specifically studied); LLM-as-judge autograder dependency throughout vector training, frontier-trend evaluation, and rubric scoring; sycophancy correlation with relationship-seeking confounds high-λ conditions (mitigated by operating range λ ∈ [−1, +1]); general-purpose chat AI rather than therapy AI (APA 2025 / Heinz et al. 2025 cited as the contrast case); influence-task data (moral persuasion, action persuasion, return likelihood) collected but held for a separate manuscript.

Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships

cited in