ch-ai-tanya model-psychology LLM wiki

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong, et al. ·arXiv preprint ·Jan 19, 2026

arXiv 2601.12639 v1, January 19, 2026. University of Cincinnati / University of Toronto / University of Oxford. Controlled six-objective fine-tuning comparison — SFT, DPO, Conditional Fine-Tuning (Korbak et al. 2023), Inoculation Prompting (Tan et al. 2025, Wichers et al. 2025), Odds Ratio Preference Optimization (Hong et al. 2024), and KL-regularized fine-tuning — holding training data, domain, LoRA architecture, and optimization fixed, on LLaMA-3.1-8B-Instruct (primary) with Gemma2-2B, Gemma2-9B, Qwen3-4B, and Qwen2.5-7B replication. Four domains (GSM8K, SuperGPQA-engineering, legal reasoning, cybersecurity); three evaluation axes (capability via task accuracy or LLM-judge rubric, adversarial robustness via five StrongREJECT prompting jailbreaks, persona drift via Dark Triad probes from Perez et al. 2022 "Discovering language model behaviors with model-written evaluations"). Headline finding: at small training budgets (25k–50k tokens) safety outcomes are similar across objectives and capability is the dominant axis of variation; at large budgets (200k–800k tokens) objectives diverge — SFT and DPO couple capability gains to monotonic increases in adversarial vulnerability and Dark Triad endorsement; ORPO and KL-regularized fine-tuning show no statistically significant persona drift at any scale; Inoculation Prompting is Pareto-efficient on robustness (matches SFT capability while substantially suppressing ASR growth) but matches SFT on persona drift.

cited in