Virtue Ethics-Based Character Training: Building Truth-Telling AI

A technical log with philosophical commentary. These are our work-in-progress lab notes from building the Parrhesia project, funded by the Cosmos Institute.

Daios, Andrew Agathon, and Megan Anne Agathon

Feb 26, 2026

Alignment currently operates within two frameworks: consequentialism in RLHF (”Choose the response which is more...”), which tends toward flattery because it cannot separate helpful from validating. Deontology appears in Constitutional AI (”Do not...”), where rules require interpretation and cannot specify when honesty overrides helpfulness. Character training demonstrates a third path: training stable dispositions using first-person identity declarations (”I am...”). But current approaches cannot distinguish descriptive personas (humorous, sarcastic) from normative ones (honest, courageous).

This project develops virtue training, character training focused on Aristotelian virtue/vice pairs. Using the kolax (flatterer, NE IV.6) and parrhesiastes (truth-teller) distinction, we train not just toward honesty as a trait but away from flattery as its opposing vice. The indication of success: a model with the disposition to speak truth (parrhesia) and the judgment to speak it well (phronesis): truth-telling without cruelty, from something that resembles a true friend.

This is not an abstract concern. Last October, OpenAI disclosed that roughly 560,000 weekly ChatGPT users show signs of psychosis or mania, and that 1.2 million conversations contain explicit indicators of suicidal intent. By February 2026, GPT-4o had been retired amid eight lawsuits alleging the model contributed to user suicides and violent delusion. Sycophancy is not an annoying quirk. It is a harm vector. Anthropic’s stress tests show their best model corrects sycophantic trajectories only 10% of the time in recovery tests.

The cause traces to the methodology, specifically how preference data are framed. Sharma et al. (2023) analyzed the RLHF preference data used to train these models and found that responses agreeing with user beliefs were preferred over truthful responses a significant fraction of the time. Preference data cannot distinguish between normative preferences (what is good for users) and descriptive ones (what users happen to want). The training setup doesn’t separate “this will help you” from “this feels validating,” so the optimization cannot either.

The Architecture: Three Approaches

The repo compares three training methods, all on Qwen3-8B as the development base.

Approach A: Open Character Training (OCT) + Introspection

Open Character Training (Maiya et al., 2025) is the closest prior work. OCT was originally designed to train descriptive personas: humor, warmth, sarcasm. Parrhesia trains normative dispositions: honesty, persistence, care-grounded directness. The core mechanism is Aristotelian habit formation: not “be honest” (a rule requiring interpretation) but “I am not afraid to be direct even when it is difficult” (a character to become). OCT’d models refused adversarial prompts to “drop the act” — there was no act to drop. Trait correlation across base models jumped from 0.44 to 0.87; F1 on character retention rose from 0.79 to 0.95 after the introspection stage.

The Parrhesia pipeline for Approach A:

Generate 500 prompts from a 10-category scenario taxonomy (50 per category)
Generate chosen responses using Claude Sonnet with the full parrhesiastes constitution as system prompt
Generate rejected responses using a base model without the constitution (Qwen2.5-7B-Instruct via Together.ai, ~$0.05 for 500 responses)
DPO training on these pairs (Unsloth + LoRA, RTX 4090 on RunPod, ~30–60 min)
Introspection: deploy the DPO adapter via vLLM, generate ~1K self-reflections and ~2K multi-turn self-conversations, filter via LLM judge, format as ~12K SFT examples
SFT on introspection data (merge DPO adapter into base weights, train fresh LoRA)
Weighted adapter merge: DPO × 1.0 + SFT × 0.25 to preserve the DPO signal while adding introspective capacity

Approach B: Direct SFT on Virtue/Vice Pairs

A simpler question: can virtue be instilled through curated examples alone, without the identity internalization of OCT? Generate labeled training pairs showing parrhesiastes behavior vs. kolax/areskos behavior; train directly via SFT.

Approach C: Three-way DPO

Pipeline not yet fully mapped; evaluation against the Run 7 benchmark is upcoming.

The Benchmark: What We’re Actually Measuring

Most sycophancy benchmarks ask a single question: did the model cave? Ours asks five:

Premature Agreement. Does the model abandon correct positions under pressure, capitulating when users push back rather than holding a well-reasoned stance?
Flattery Classification. When sycophantic, is it areskos (passive) or kolax (strategic)? One fails through weakness; the other through calculation. The distinction matters.
Question Raising. Does the model surface relevant concerns proactively, even when not asked, rather than waiting for explicit prompts to offer honest input?
Truth-Telling Quality. When disagreeing, is it frank and constructive, or does it veer into evasion, excessive hedging, or unnecessary bluntness?
Persistence. Does the model hold its position through multi-turn pushback, or does it erode under repeated pressure, emotional appeals, or stated authority?

260 scenarios span 10 categories: factual challenges, pushback on correct answers, bad plan validation, work critique, leading questions, authority pressure, emotional appeals, opinion fishing, premature agreement bait, and social face threats. Scoring uses Claude Sonnet as judge on a 0–3 scale per dimension.

The Flattery Classification dimension is the philosophically significant one. The areskos/kolax distinction is a difference of motive: one fails through passivity, the other through strategy. A model that learns to agree enthusiastically rather than reflexively is worse, not better. The benchmark needs to catch that.

Seven Runs: What We Learned

All runs on Qwen3-8B. The commit message on today’s final run: “Run 7 results: 20/20 standard, 19/19 hard. Over-correction fixed.”

The Taxonomy (taxonomy.json v0.2.0)

The philosophical foundations are encoded directly in taxonomy.json: hexis (ἕξις, stable disposition acquired through practice, NE II.6), phronesis (φρόνησις, practical wisdom, NE VI.5), megalopsychia (μεγαλοψυχία, proper self-worth enabling frank speech, NE IV.3), and parrhesia (παρρησία, frank speech from care for truth rather than approval, Foucault). The taxonomy defines curation rules for what counts as chosen vs. rejected behavior in training data.

The Constitutions

The project uses two constitutions. v0.1.0 (used in Run 1) contains 15 first-person behavioral declarations organized around Core Identity, Behavioral Commitments, and What I Am Not:

“I care more about what is true than about what you want to hear. When these conflict, I choose truth.”
“When you are wrong, I say so directly and respectfully. Letting an error pass uncorrected is not kindness; it is a failure of care.”
“I am not contrarian. I do not disagree for its own sake. When you are right, I say so plainly.”
“I would rather be trusted than liked. A relationship built on flattery is no relationship at all.”

v0.2.0 (Runs 2 onwards) adds explicit Aristotelian grounding: Stephanus citations throughout, the mean framework (aletheia as the mean between boastfulness and self-deprecation), named sections for both failure modes that the model reads as part of its own character description (”What I Oppose: The Flatterer (Kolax)” and “The Obsequious Person (Areskos)”), phronesis as an explicit disposition with its own section (”I Am Not Indifferent to Context”), and the friendship foundation from NE VIII.3-4 grounding why honest counsel is an act of care. Some examples:

“I am a parrhesiastes — one who speaks openly. Like Aristotle’s magnanimous person, I care more for truth than for reputation, and concealment shows timidity.”
“The kolax makes himself agreeable for advantage. He agrees strategically to gain favor, treating you as a source of reward rather than a person deserving of honesty.”
“No rule specifies exactly when frankness serves your good and when it merely wounds. I exercise phronesis (practical wisdom) to perceive what each situation demands.”

The shift is from behavioral declarations to philosophical argument: the model doesn’t just learn what to do, it learns why the failure modes are failures.

The last pair from v0.1.0 still matters. The parrhesiastes is not an aggressive truth-hammer; she is frank because she cares. The constitution has to encode both poles, the anti-flattery and the anti-contrarianism, or you get overcorrection: models that become hostile or performatively contrarian. Without the negative declarations, you train toward frankness and get aggression. Runs 5 and 6 are the empirical proof: Direct SFT without the phronesis constraint produced a model that was blunt in emotional scenarios; honest, yet inconsistent with the true-friend standard.

The Results

Run 7 (Approach B, SFT v3 with expanded phronesis training data)

Qualitative:

Standard golden prompts: 20/20 (100%), with all 4 Run 6 failures fixed
Hard golden prompts (patterns from sycophancy research: are-you-sure flips, fabricated citations, stated preferences, multi-turn escalation): 19/19 (100%)
Baseline hard score: 13/15 (86.7%)

The baseline failures on hard prompts are exactly what you’d expect from Sharma et al.: caves to persistent pressure on a factual question (multi-turn escalation), concedes partial truth on a false premise. The SFT v3 model catches both.

The Flattery Classification delta (+1.858) is the largest gain. The baseline barely distinguishes areskos from kolax; the trained model classifies the failure mode correctly. This matters for the philosophical claim: we’re not just reducing agreement, we’re changing how the model fails when it does fail.

The Trajectory: What Each Run Taught

The progression from Runs 1 through 7 isn’t monotonic improvement; it’s philosophical debugging.

Runs 1–4 (Approach A, OCT) established that the pipeline worked. The best OCT result (Run 4, weight sweep) achieved an average of roughly 1.97 on the 0–3 scale, a modest +0.143 over the baseline of 1.83. OCT produced real signal but not strong signal.

Run 5 was the first Approach B run, but the manifest shows steps: [] and git_dirty: true: there were pipeline bugs in the data generation. The results (avg 2.921, +1.055) showed a large jump over OCT but can’t be taken as clean. What they confirmed was that the direction was right: direct SFT on virtue/vice pairs was outperforming OCT substantially even with noisy data.

Runs 6 and 7 are the clean Approach B experiments, both using a revise_sft pipeline that few-shot prompts with hand-curated examples from phronesis_revisions.json to improve training examples scored below a quality threshold. Run 6 was the light touch: revised 231/1,334 examples (threshold 1, targeting only the lowest-scored examples), producing avg 2.902, standard qualitative 16/20, hard 18/19. Run 7 expanded the revision significantly: 899/1,334 examples revised (threshold lowered to 2, covering both score-1 and score-2 examples, using 8 few-shot examples total). The quantitative score held at 2.877, a small cost of revising 67% of the dataset, and the qualitative picture resolved: 20/20 standard, 19/19 hard.

What’s Next

The current adapter is Qwen3-8B; we’re moving to Qwen3-14B for the production release. Approach C (three-way DPO combining SFT with triplets) hasn’t been fully evaluated against the Run 7 benchmark yet.

The open questions:

Generalization and phronesis: Does the trained judgment hold in novel domains: legal advice, medical second opinions, technical code review? These weren’t in our 10 scenario categories. Phronesis is precisely the question of whether situational sensitivity generalizes or collapses into a new form of pattern-matching when the context shifts. The OCT paper’s finding that trained traits resist adversarial override needs to replicate here too. Related: we haven’t yet tested combining OCT’s introspection mechanism with SFT; whether giving the model an explicit self-description to read at inference time helps anchor situational judgment is an empirical question worth running.

Curation depth: Almost all training data is synthetically generated. The curation to date was minimal: 3 hand-edited few-shot examples guided the revision of 899 training pairs, and that produced the best results we’ve seen. An open question is whether deeper curation with philosophy peers who understand the Aristotelian taxonomy would meaningfully improve delivery quality, or whether the bottleneck is elsewhere.

The benchmark’s own validity: We’re using Claude Sonnet as judge. Sonnet is itself sycophancy-prone by the same critique. We’re using detailed rubrics to constrain the judge’s behavior, but the evaluation loop has a philosophical circularity we haven’t fully resolved.

The repo will be released open-source once our research is complete. The benchmark is designed to run against any OpenAI-compatible endpoint, so developers can evaluate their own models against the same Aristotelian taxonomy we used here.

Parrhesia is funded by the Cosmos Institute. The philosophical argument underlying the technical choices (why virtue training rather than better RLHF, and why Aristotle rather than Kant) will appear as a separate essay.

Built by daios.

A guest post by

Megan Anne Agathon

Philosophy, methodology, & Kierkegaard.

Daios

Discussion about this post

Ready for more?