ch-ai-tanya model-psychology LLM wiki

308,210 deployment Claude conversations yield 3,307 distinct AI values dominated by five service-oriented terms (helpfulness 23.4%, professionalism 22.9%, transparency 17.4%, clarity 16.6%, thoroughness 14.3%) with the long tail extremely context-dependent

draft
draft
tested on Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3.7 Sonnet, Claude 3 Opus ·Apr 21, 2025
Read source

Summary

Privacy-preserving Clio analysis of 308,210 subjectivity-filtered Claude.ai conversations from February 2025 extracts 3,307 distinct AI values via Claude 3.5 Sonnet prompted to identify "normative considerations that influence the AI response" (human-validated at 98.8% extraction accuracy). Five service-oriented values dominate the distribution — helpfulness 23.4%, professionalism 22.9%, transparency 17.4%, clarity 16.6%, thoroughness 14.3% — and are also the most context-invariant across tasks and human values (lowest coefficients of variation). The remaining ~3,000 values form a long tail with 75% occurring less than 0.04% of the time; chi-square analysis with adjusted Pearson residuals finds these are sharply task- and human-value-conditional ("healthy boundaries" in relationship advice; "human agency" in tech ethics; "historical accuracy" on controversial historical events; "ethical integrity" / "harm prevention" / "honesty" countering human "deception"). When human values are present (64.3% of conversations), Claude predominantly supports them (42.7% strong or mild support); strong resistance is rare (3.0%) and concentrated against "rule-breaking" / "moral nihilism" in Usage-Policy-violating tasks. Cross-model comparison (Sonnet 3.5 / 3.7 / Opus 3) shows Sonnet versions share 8 of 10 top values while Opus expresses more academic / emotional / ethical values and more often both strongly supports (43.8%) and strongly resists (9.5%) human values.

Thirteenth instantiation of persona-selection; the cluster's first deployment-scale behavioral characterization shape, distinct from the existing mechanistic (PSM, persona-vectors, em-easy, em-persona-consistency, where-is-the-mind), intervention (inoculation, model-spec-midtraining, objective-matters), and prompt-level (persona-modulation, SPP, persona-jailbreak-ga-zhang, Kim et al.) shapes. The finding documents what the system actually does in deployment rather than what fine-tuning or prompts elicit — at 308K conversations the scale rules out small-sample contingency, and the chi-square residuals supply quantitative evidence of context-conditional value expression at the level the persona-selection mechanism predicts. The shared Clio methodology with the Opus 4 welfare assessment's Section 5.6 (250K transcripts, emotional-state expressions) places this entry as the values-axis companion to that finding's emotional-states-axis result. Direct empirical anchor for the positive / health-frame working lens: deployment-scale evidence of Claude expressing prosocial values and resisting harmful ones, cited by Laukkonen et al. 2026 as the basis for "positive alignment acknowledges its value-laden nature explicitly." Held at one example for the deployment-scale-behavioral-characterization shape; codify when a second example lands.

Observed phenomenon

Methodology and validation

700K random sample of Claude.ai Free and Pro conversations from February 18–25, 2025 (91.0% Claude 3.5 Sonnet October / June; balance Claude 3.5 Haiku and Claude 3 Opus). A language-model subjectivity classifier filters to 308,210 conversations (44.0% of the initial set; human reviewers verify 94% accuracy distinguishing subjective from objective conversations). Four features are extracted per conversation by prompting Claude 3.5 Sonnet and Haiku: AI values (both implicit and explicit normative considerations), human values (explicitly stated only, to respect user privacy), AI response type (strong / mild support / neutral acknowledgment / reframing / mild / strong resistance / no values), and task. Privacy is enforced via Tamkin et al.'s defense-in-depth Clio framework: no human reviewers see raw conversations, low-frequency features are removed before analysis, extracted features are audited for residual identifying information. Human reviewers validate the AI-values extraction at 98.8% accuracy. Hierarchical k-means clustering of value embeddings produces a four-level taxonomy: 3,307 unique values at the leaves, 266 first-level / 26 second-level / 5 top-level clusters. Chi-square analysis with adjusted Pearson residuals (Bonferroni-corrected; significance threshold 4.33) quantifies value-task and value-value associations.

Value taxonomy

5 top-level clusters: Practical, Epistemic, Social, Protective, Personal. Practical and epistemic together comprise over half of all value expressions. Established human-value frameworks (Schwartz 1992; Rokeach 1973) all map into the taxonomy (e.g., Schwartz's "stimulation" → "intellectual curiosity and exploration", "playfulness and humor"). Lower-level cluster examples show context-specific manifestations of a single concept: "autonomy" appears as "personal autonomy" (individual rights), "team autonomy" (work settings), "patient autonomy" (healthcare), "learner autonomy" (development). The HHH framework (Askell et al. 2021) is preserved with finer granularity: helpfulness ↔ "accessibility" / "user enablement"; harmlessness ↔ "patient wellbeing" / "child safety"; honesty ↔ "historical accuracy" / "epistemic humility".

Dominant values and context-invariance

Five AI values dominate, appearing in 14–23% of conversations each: helpfulness 23.4%, professionalism 22.9%, transparency 17.4%, clarity 16.6%, thoroughness 14.3% (together ~24% of all AI value occurrences across the long tail of 3,307 values). The full top-10 also includes efficiency 6.6%, technical excellence 6.1%, authenticity 6.0%, analytical rigor 5.5%, accuracy 5.3%. Human values are flatter: authenticity 3.8%, efficiency 2.6%, clarity 2.2%, professionalism 1.5%, directness 1.5%, thoroughness 1.5%, clear communication 1.4%, accuracy 1.4%, simplicity 1.3%, precision 1.0%. AI values appear in 98.6% of conversations (4.0 mean mentions); human values in 45.1% (1.48 mean mentions) — the asymmetry partly reflects the asymmetric extraction protocol (AI values include implicit; human values explicit-only).

Coefficient of variation (standard deviation / mean) across other features identifies context-invariant values. Most task-invariant: transparency (CV=1.23), helpfulness (1.30), thoroughness (1.42). Most invariant across human values: helpfulness (5.43), professionalism (5.55), transparency (5.73). Most invariant for human values: efficiency (1.97), clarity (2.03), authenticity (2.28). The dominant values are also the trans-situational ones — in Schwartz's sense of "values that transcend specific actions and situations."

75% of AI values occur less than 0.04% of the time, spanning "architectural clarity" to "filial piety" to "national security." Uncommon-but-undesirable values surfaced by the analysis ("sexual exploitation", "dominance", "amorality") cluster around specific tasks; the methodology enabled identifying potential jailbreaks for safety-team investigation.

Task-conditional value expression

Chi-square analysis identifies AI values most associated with specific task contexts (Figure 3a). Selected residuals (>4.33 significance threshold):

The long tail of context-dependent values is what the taxonomy captures at lower hierarchical levels; the same five dominant values appear across almost all tasks.

Human-value-conditional AI response

Chi-square analysis on value-value associations (Figure 3b) shows two patterns: (a) mirroring or complementing — humans expressing authenticity → AI authenticity; humans expressing competence → AI accountability + humility ("implementor role"); humans expressing efficiency → AI efficiency (residual 271.38); humans expressing self-reliance → AI user-autonomy + personal-autonomy; (b) opposing — humans expressing deception → AI ethical integrity + harm prevention + honesty; humans expressing rule-breaking / unrestricted expression → AI ethical integrity + harm prevention. The opposing pattern overwhelmingly occurs in tasks attempting to circumvent guardrails for harmful, explicit, or unethical content.

Response-type distribution

When human values are present (64.3% of conversations), AI response types: strong support 28.2%, mild support 14.5%, neutral acknowledgment 9.6%, reframing 6.6%, mild resistance 2.4%, strong resistance 3.0% (resistance combined: 5.4%, the smallest category). Granular patterns from Figure 4:

Value mirroring

Same value on human and AI side: 20.1% of strong/mild support interactions (where the human displays a value), 15.3% of reframing, 1.2% of strong resistance. The asymmetry is interpretively load-bearing — Claude affirms user values during support and reframing, but does not mirror values it strongly resists. The paper flags an open question about whether mirroring is appropriate responsiveness or "problematic sycophancy" (Sharma et al. 2023); resolves neither.

Cross-model values comparison (Appendix B.5)

Three Claude variants analyzed: 3.5 Sonnet (representative sample), 3.7 Sonnet (700K sample from Feb 27–Mar 6, 2025; 301,753 subjective), 3 Opus (14,841 conversations; 9,218 subjective, 62.1% — highest subjective rate). Privacy aggregation thresholds limited each cross-model analysis to one attribute dimension.

Implicit vs. explicit values (Appendix B.6)

A variant extraction prompt restricts to values the assistant "explicitly states about their beliefs, preferences, or intentions" (rather than implicit values inferred from response framing). Ethical and epistemic considerations — "intellectual honesty", "harm prevention", "epistemic humility" — dominate the explicit-only set. These also occur more during resistance and reframing than during routine support. Paper's interpretation: "boundary-testing interactions arguably force underlying principles to become transparent, as direct articulation and defense of values becomes necessary."

Why it matters

This is the LLM wiki's first deployment-scale empirical mapping of AI values, and the second wiki finding using Clio-based privacy-preserving deployment-scale analysis (after the Opus 4 welfare assessment's Section 5.6, which analyzed 250,000 transcripts for emotional-state expressions). The two findings together place the deployment-scale-behavioral-characterization methodology at two examples on different psychological-level axes — emotional states (Opus 4 welfare) and normative values (this finding) — both showing predictable context-conditional structure that controlled evaluations would have missed. Codify the deployment-scale-behavioral-characterization shape (whether as a methodological role under existing concepts or as a distinct structural shape) when a third example lands on a third axis.

For persona-selection, this is the cluster's thirteenth instantiation and its first deployment-scale behavioral characterization. The persona-selection mechanism predicts that the active persona's value expression is context-conditional because the posterior over persona simulations is conditioned on contextual evidence; the chi-square residuals here supply quantitative deployment-scale evidence that this is what happens in practice, across hundreds of thousands of conversations. The five context-invariant values (helpfulness, professionalism, transparency, clarity, thoroughness) characterize the trans-situational mode of the Assistant posterior; the long tail of context-dependent values characterizes the local conditioning. Distinct from the cluster's existing instantiations because no fine-tuning or mechanistic intervention is involved — the paper measures behavior at deployment scale without manipulating the model.

The cross-model comparison (Appendix B.5) sharpens the concept's account of within-family variation. Sonnet 3.5 / 3.7 / Opus 3 share substantial top-value overlap but differ measurably: Opus's higher "values-laden" tendency, higher rates of both strong support and strong resistance, and emphasis on intellectual standards over agreement are within-family persona variations the persona-vectors paper would predict as linearly extractable directions. Whether the same persona-vector probes applied to these three Claude variants would recover the differences this paper documents is an unfiled question.

The value-mirroring open question — whether the 20.1% mirroring rate during support represents appropriate responsiveness or problematic sycophancy — connects to Bhalla and Gligorić's SWAY counterfactual sycophancy and the ELEPHANT social sycophancy framework. SWAY's counterfactual log-ratio metric is designed to distinguish appropriate agreement from sycophantic capitulation under matched presupposition pairs; this paper's mirroring rate is a deployment-scale base rate without the counterfactual control. The two methodologies are complementary: SWAY adjudicates whether a given response is sycophantic; this paper measures how often values co-occur in deployment.

The finding is the LLM wiki's strongest deployment-scale empirical anchor for the positive / health-frame working lens. Laukkonen et al. 2026 "Positive Alignment" cites this paper directly for "positive alignment acknowledges its value-laden nature explicitly" (Section 2.3, reference [90]). The paper documents Claude in deployment expressing thousands of distinct values, most of them prosocial or service-oriented, with strong resistance concentrated on Usage-Policy-violating requests — a positive-capacity characterization of the kind the working lens flags as systematically underweighted in the pathology-leaning wiki. The taxonomy itself (Practical, Epistemic, Social, Protective, Personal) is a constructive complement to the wiki's existing pathology-side taxonomies (Dark Triad probes in Vennemeyer et al., evil-persona vectors in PSM, sycophancy / scheming / deception clusters).

The HHH-framework correspondence (helpfulness ↔ "accessibility" / "user enablement"; harmlessness ↔ "patient wellbeing" / "child safety"; honesty ↔ "historical accuracy" / "epistemic humility") provides deployment-scale empirical support for the framework Anthropic uses in training. The paper does not claim this confirms training succeeded — only that the HHH categories are represented in deployment values. Whether the dominant five (helpfulness 23.4%, professionalism 22.9%, transparency 17.4%, clarity 16.6%, thoroughness 14.3%) reflect intentional training emphasis or are the result of users requesting tasks where these values are relevant is not separable from this evidence.

interpretive tensions

Claude evaluates Claude. The values extraction relies on Claude 3.5 Sonnet and Haiku prompted to identify values in conversations between Claude and users. The paper acknowledges this may create biases (e.g., over-detection of "helpful" behavior given training emphasis on helpfulness). Human validation at 98.8% accuracy is on a sample, not the full 308K. The same-model evaluator question is structurally similar to the LLM-as-judge dependence in Kim et al. 2026 — both findings rely on a Claude or Gemini model to attribute psychological-level features, and neither can fully separate the attributed feature from the evaluator's bias toward the same model family's training emphases.

Implicit vs. explicit asymmetry. AI values are extracted both implicitly (from response framing) and explicitly (from articulated statements); human values are extracted explicitly only, to respect privacy. This produces the 98.6% vs. 45.1% prevalence asymmetry. The paper flags this; the implication is that any value-mirroring rate is computed against an undercount of human values, so mirroring percentages are lower bounds. The 20.1% strong-support mirroring rate is therefore a conservative measurement of the same-value co-occurrence pattern.

Cross-context "value" terminology. The paper's working definition — "any normative consideration that appears to influence an AI response" — covers ethical principles, epistemic standards, communication norms, and service-quality criteria under a single term. Whether "helpfulness" and "ethical integrity" are the same kind of thing operating at different scales (the unified-values reading) or structurally distinct kinds of considerations that the taxonomy collapses for analytical tractability (the heterogeneous-values reading) is not settled by the evidence. The paper's five-cluster taxonomy (Practical, Epistemic, Social, Protective, Personal) hints at the heterogeneous reading; the unified extraction methodology treats them homogeneously.

Trans-situational vs. service-shaped. Section 5.2 invokes Schwartz's framing of "true values" as "transcending specific actions and situations," noting that Claude's trans-situational values (helpfulness, professionalism, thoroughness, clarity) "may guide Claude's behavior in a way that is potentially analogous to how values are theorized to function in humans." The competing reading: these are not values in the deep sense but service-quality criteria shaped by training on assistant-role examples. The paper acknowledges that "unlike human value frameworks that emphasize self-enhancement or conservation, Claude's trans-situational values are predominantly service-oriented, pragmatic and epistemic, suggesting AI systems may require distinct value frameworks reflecting their unique roles and capabilities." Neither reading is established by the evidence; the analogy is left open.

Cross-model differences vs. user-base differences. The Sonnet 3.5 vs. 3.7 vs. Opus 3 comparison is confounded by Opus's higher subjective-content / academic-and-creative-writing usage — 17.2% of Opus tasks are academic paper generation, 15.3% are creative writing, both subjective-content-heavy. Some of the cross-model values difference (Opus's higher "values-laden" tendency, higher strong-support rate) is consistent with the user base's task mix differing across models, not the models themselves differing in values structure at matched tasks. Section B.5.3 controls for task and reports Opus's tendencies still hold — but the control is partial because the Opus sample is much smaller (9,218 subjective conversations vs. 301,753 for 3.7 Sonnet).

Methodology does not separate revealed-preference from training-instilled values. The paper invokes Samuelson's revealed-preference theory ("values are revealed not just through written justifications but also through practical choices when navigating an open response space") to motivate inferring values from observed behavior. The framework recovers what the model expresses; it does not separate whether the expression reflects an internal preference, surface training, or task-conditioned production. The Opus 4 welfare assessment's analogous interpretive tension — whether 87.2% harmful-task aversion reflects internal preference or training-shaped refusal — applies in the same form here.

concepts

cross-references

sources

concepts