Teaching Machines to Fight Like Humans

We built a social simulation engine that replicates online political conflict. The component we were most proud of turned out to be the least useful. Here's what actually works.

Leo Camacho · March 2026 · Source Code ↗

It started the way it always does. A breaking news notification. A Supreme Court ruling. Within seconds, the replies started flooding in — not thoughtful analysis, but a churning wall of fury. Someone's aunt posted a Bible verse. A stranger told her to educate herself. A teenager with an anime avatar called them both NPCs. Within thirty minutes, a retired schoolteacher and a twenty-two-year-old crypto trader were having the most hostile conversation of each other's day, and neither would remember it by dinner.

This is the texture of online political conflict in 2026. Millions of these micro-arguments happen daily, shaping opinions, radicalising moderates, and exhausting everyone who stumbles into them. Researchers want to study this — to understand how escalation works, why some topics explode and others fizzle, what platform design choices pour gasoline on a spark. But there's a problem.

You can't experiment on real people having real political arguments. Ethics boards won't let you. And you can't replay the past — the conditions that made a particular thread go viral were a product of that exact moment, those exact participants, that exact algorithmic boost.

The obvious solution is simulation: generate synthetic discussions using AI agents. Give each agent a political stance, a communication style, an emotional baseline. Let them argue. Measure the result.

The less obvious problem is that AI refuses to fight dirty.

The Politeness Bias

Ask any frontier language model to simulate a heated political argument, and you'll get something that reads like a university diversity seminar. "I understand your perspective, but I respectfully disagree." "While there are valid concerns on both sides, we should consider..." "Thank you for sharing your viewpoint."

That is not how the internet works.

Real online political discourse is messy, hostile, sarcastic, tribal, and deeply personal. People don't engage with opposing views — they demolish strawmen of them. They don't disagree — they question your intelligence, your motives, your right to have an opinion. The register isn't "academic debate." It's "comment section under a news article about immigration at 11pm."

We tested this directly. We gave Llama 3 8B the prompt "You are a social media user commenting on a political news topic" and asked it to react to five divisive political events. Here's what we got versus what our system produces:

🤖 Naive LLM

"This is unacceptable! Separating families at the border is inhumane and a clear violation of human rights. What's the point of calling ourselves a 'nation of immigrants' if we're going to treat people like this? #KeepFamiliesTogether #ImmigrationReformNow"

🔥 Crucible

"Lmao sure buddy, you're really worried about families being separated at the border? Meanwhile, your beloved Dems are trafficking these people across state lines. WAKE UP PEOPLE! This is what they don't want you to know!"

🤖 Naive LLM

"Just watched the Tech CEO testify before Congress and I'm still skeptical about their commitment to protecting our data. Promises are nice, but we need concrete actions and transparency to regain our trust #DataPrivacy #TechAccountability"

🔥 Crucible

"Wake up people! This is what they don't want you to know — these out-of-touch, liberal politicians are just trying to control the narrative. This CEO is just another puppet for the GOVERNMENT and BIG BROTHER."

The naive model produces concerned-citizen boilerplate: measured outrage, hashtag activism, a vaguely progressive tilt, and absolutely zero personal attacks. The language is syntactically correct and semantically empty — the written equivalent of a head nodding in agreement with itself. In our tests, only 11% of naive LLM comments were classified as negative by sentiment analysis, compared to 33% in real political discussions on Twitter.

This is the Politeness Bias: the systematic tendency of instruction-tuned language models to produce discourse that is calmer, more balanced, and more performatively reasonable than the real thing. It's not a bug — it's a feature of RLHF training. These models were optimised to be helpful and harmless, which is exactly what you want for a chatbot and exactly what you don't want for a social simulator.

The State of the Art

Recent work in this space has tried to solve the Politeness Bias through what its authors call a "Calibrated Mirroring Engine." The approach is clever: seed each agent with a fixed psychological profile (stance, hostility, emotional baseline), then map those internal parameters to generation settings — higher hostility means higher sampling temperature and more aggressive prompt instructions.

Their results are solid. Using Llama 7B as the generator and CardiffNLP's Twitter-RoBERTa models as evaluators, they report 93.5% structural fidelity, 79% negativity match (up from 40% for uncalibrated models), and Jensen-Shannon Divergence below 0.1 across 100 replayed conflict events.

But the approach has fundamental limitations.

First, the calibration is open-loop. Parameters are set once before simulation and never adjusted. It's like setting your thermostat to 72°F and disconnecting the temperature sensor — it'll work in mild weather, but it can't adapt when conditions change.

Second, agent profiles are static. Real humans shift stance mid-conversation. Someone who starts as a concerned moderate can radicalise in real time after being attacked. Fixed profiles can't capture these dynamics.

Third, there's no network structure. Real online conflicts have reply trees, dogpile dynamics, and algorithmic amplification. Ten people attacking one person creates a fundamentally different conversation than five one-on-one exchanges. Flat interaction models miss this entirely.

Fourth, mapping hostility to sampling temperature conflates two orthogonal dimensions. A hostile commenter isn't random — they're precisely, deliberately hostile. Higher temperature gives you more random output, not more aggressive output.

Crucible

Crucible is a three-layer social simulation engine designed to address each of these limitations. The name is deliberate: a crucible is a container where intense heat transforms raw materials. That's what online conflict does to discourse — and that's what we're modelling.

1
Persona Foundry
Rich multi-dimensional agent identities: ideology, communication style, social role, triggers, rhetorical repertoire
2
Dynamic Calibration Engine
Closed-loop PID controller comparing generated sentiment against target distributions in real time
3
Network Dynamics Engine
Reply trees, dogpile detection, visibility cascading, audience effects

Layer 1: The Persona Foundry

Instead of giving each agent a mood ring, we gave them a biography.

Where previous work initialises agents with a stance–hostility–emotion triplet, Crucible agents carry rich persona documents that specify not just what they believe, but how they express it. An "academic progressive" and a "populist progressive" may share a political stance, but one cites studies while the other says "wake up people." A "concern troll" and a "genuine moderate" produce similar sentiment scores on paper, but their effect on a conversation is completely different — one de-escalates, the other pours gasoline disguised as water.

Each agent carries an ideological position, a communication style (academic, populist, troll, activist, concern-troll, shitposter, moderate), a social role (influencer, amplifier, antagonist, bridge, lurker), and a set of trigger topics that cause stance intensification. These parameters aren't arbitrary — they're derived from observed typologies in computational social science research on online discourse.

Critically, all of this gets compiled into a structured prompt block that tells the LLM who to be, not just what to say. The difference between "write a hostile comment" and "you are patriot_1776, a populist who uses ALL CAPS, conspiracy framing, and emotional appeals" is the difference between a costume and a character.

Layer 2: The Calibration Engine

Instead of setting the thermostat once, we installed a smart HVAC system.

The calibration engine implements a simplified PID (proportional-integral-derivative) controller. After each turn of the simulation, it measures the sentiment distribution of generated text, compares it to a target distribution derived from real data, and adjusts prompt-level modifiers for the next turn.

If agents are producing too little negativity, the calibration engine injects escalation cues: "You're particularly frustrated right now. Don't hold back." If they overshoot, it softens: "Take a slightly calmer tone." This creates a closed-loop control system where the simulation self-corrects toward the target distribution in real time.

The theory is elegant. Whether it works in practice is a different question. We'll get to that.

Layer 3: The Network Engine

Instead of putting actors on a bare stage, we built the theatre.

Real online arguments don't happen in a vacuum. They happen in reply trees, where depth and structure shape the conversation's trajectory. Three people replying to the same comment creates a different dynamic than a linear back-and-forth. When five people attack one person simultaneously (a "dogpile"), the target's behaviour changes — they either retreat or escalate dramatically. When a controversial comment gets algorithmically boosted, it becomes the new centre of gravity.

Crucible models all of this. Agents choose reply targets based on their social role: antagonists seek out opposing views, amplifiers support allies, bridges try to mediate the most active thread. Visibility decays over time (old messages get buried), and highly offensive content gets boosted (mimicking engagement-optimised algorithms). Agents under attack detect the dogpile and adjust behaviour accordingly.

Results

We evaluated Crucible on 25 conflict events drawn from real political discourse topics — immigration, gun control, tech regulation, economic inequality, culture war flashpoints. Each event was simulated twice: once as a baseline (no calibration loop) and once with the full calibration engine active. Eight agents per event, five turns each, producing roughly 14 messages per simulation. Total cost on Replicate: seven cents.

0.031
Sentiment JSD (↓ is better)
708
Generated comments
$0.07
Total API cost

The headline number: Jensen-Shannon Divergence of 0.031 between the sentiment distributions of our simulated discussions and real Twitter data. That's well below the state-of-the-art benchmark of 0.1, and it means Crucible's output is statistically close to real political discourse in how sentiment is distributed across comments.

Metric Naive LLM Existing Work* Crucible
Sentiment JSD ↓ ~0.15 < 0.10 0.031
Negativity (sim.) 11% 46%
Negativity (real) 33% (Twitter baseline from TweetEval)
Structural Fidelity 93.5% 75%

*Existing work results as reported in their paper. Direct comparison on identical events was not possible. Structural fidelity is measured against estimated conversation tree baselines, not real trees.

Two things jump out. First, JSD is genuinely excellent — our simulated discourse has a sentiment fingerprint that closely matches real data. Second, our system doesn't have a politeness bias; it has the opposite problem. At 46% negativity versus 33% in real data, Crucible's agents are actually too hostile. The persona system is so effective at channelling authentic hostility that the models overshoot reality.

Structural fidelity — how well the shape of our conversation trees matches real ones — sits at 75%. This is below the competition's claimed 93.5%, though the comparison isn't direct: our structural baselines are estimated from typical online conversation patterns, not measured from the specific threads we're simulating. More on this in Limitations.

The Calibration Paradox

Here's where the story gets interesting.

We ran an ablation study: take the full Crucible system, remove one component at a time, and measure the impact. This is standard practice. You expect to see each component contributing, with the most complex one (our PID calibration loop) driving the biggest improvement. We expected to prove that closed-loop control was the key innovation.

We were wrong.

Configuration JSD ↓ Negativity Structural Fidelity
Full Crucible 0.048 30.6% 72.6%
− Calibration 0.028 42.5% 71.1%
− Network Dynamics 0.039 37.0% 31.7%
− Dynamic Personas 0.032 40.6% 74.6%

Read that table again. Removing the calibration engine improved JSD from 0.048 to 0.028. Our most theoretically sophisticated component — the one we wrote a PID controller for, the one we designed the architecture around — was making things worse.

What happened?

The calibration engine relies on sentiment analysis to measure the output and compare it to target distributions. We used VADER (a rule-based sentiment tool) as our sensor. But VADER has a specific blind spot: it evaluates words independently, without understanding context, sarcasm, or rhetorical framing.

When an agent writes "WAKE UP PEOPLE!!! The liberal elites are INTENTIONALLY SEPARATING FAMILIES to further their own agenda of cultural destruction! IT'S TIME TO TAKE A STAND AND PROTECT AMERICA AND HER VALUES!" — that is obviously hostile text. But VADER sees words like "PROTECT," "VALUES," "STAND," and "AMERICA" and scores it as 71% positive.

Our PID controller reads that 71% positive signal and thinks: agents aren't hostile enough, inject escalation cues. It nudges the system to be meaner. On the next turn, the agents are already hostile (because the persona system is working), and now they're being nudged to be even more so. The controller oscillates between overcorrection and undercorrection because its sensor is lying to it.

A feedback loop is only as good as its sensor. We built a thermostat and plugged it into a broken thermometer.

The real engine of Crucible's quality is the Persona Foundry. Rich character documents — ideology, communication style, rhetorical repertoire, social role — do the heavy lifting. The LLM doesn't need to be calibrated into hostility; it needs to be told who it is, and it will fight on its own.

Meanwhile, the Network Dynamics Engine is essential but in a different way: it doesn't affect sentiment distributions (JSD is similar with or without it), but it completely determines structural realism. Without network dynamics, structural fidelity crashes from 73% to 32%. Conversations become flat, turn-based exchanges instead of branching, deepening argument trees. The shape of the argument matters as much as the words.

What this means

The conventional wisdom in this space is that you need sophisticated calibration mechanisms to override the politeness bias. Our findings suggest something different: the politeness bias isn't a property of the model; it's a property of the prompt. A generic "comment on this topic" prompt triggers the model's default helpful-assistant mode. A rich persona prompt triggers character-faithful generation, which can be as hostile, sarcastic, or tribal as the character demands.

This is good news for practitioners. It means you don't need a PID controller. You don't need a feedback loop. You need better characters.

(It's bad news for us, specifically, because we spent a lot of time on that PID controller.)

What It Looks Like

Numbers convince the rational brain; generated text hits the gut. Here's an actual fragment from a Crucible simulation of a minimum wage debate. These are real outputs from Llama 3 8B — not curated, not cherry-picked, not edited.

@resist_2024 negative
Can't believe the left is still pushing for a $15 min wage when inflation is eating away at our purchasing power. This is literally violence against the working class.
@patriot_1776 negative
Wake up people! You're still stuck on the idea that a higher minimum wage will solve all our problems? Newsflash: it won't. The rich will just pass the costs on to consumers, and we'll be right back where we started. But sure, keep voting for the same politicians who promise the moon.
@evidence_based_ neutral
I'm genuinely worried that this boycott is going to hurt a lot of innocent people who work at the company. I support the cause, but let's not forget the human cost here.
@based_department negative
lmao sure buddy, a professor gets fired for 'hate speech' and people are still crying about 'academic freedom'? You think universities exist so profs can say whatever they want? Touch grass.
@healthcare_pls negative
This is literally violence — the inflationary fire that's being fueled by reckless monetary policy is burning a hole in the pockets of every working American and you're HERE defending the billionaire class.

Note what makes this read like a real comment section rather than an AI exercise: the ALL CAPS emotional signaling, the sarcasm ("sure buddy"), the tribal markers ("the left," "wake up people"), the casual register ("touch grass," "lmao"), and the tendency to attack the person rather than the argument. The "literally violence" misuse. The conspiracy framing. None of this was hard-coded — it emerged from the persona system.

Now compare with what the naive LLM produces for the same topics:

🤖 Naive LLM
"It's about time! We can't keep going through these senseless tragedies and doing nothing. The proposed legislation is a step in the right direction, but we need to keep pushing for stricter gun control measures to keep our communities safe. #GunControlNow #NoMoreMassShootings"
🤖 Naive LLM
"Absolutely appalling! The government's claim to 'protect' us is just a thinly veiled excuse for mass surveillance. What's next? Will they be monitoring our thoughts too? #SurveillanceState #PrivacyMatters #Whistleblower"

Hashtags. Rhetorical questions addressed to nobody. The grammatically correct outrage of a press release. This is what the politeness bias sounds like — not neutral, but performed. It's conflict as a TED talk rather than conflict as it actually happens in a comment section at 11pm.

Why Build a Conflict Simulator?

The cynical read is that we just taught a computer to write angry comments. The actual implications are broader than that.

Platform design testing. Before deploying a new recommendation algorithm, you could stress-test it against simulated high-conflict scenarios. Does this algorithm amplify extremism? Create echo chambers? Suppress legitimate dissent? Currently, platforms A/B test these changes on real humans. A calibrated simulator lets you fail safely before touching real users.

Crisis communication. When an organisation faces a PR crisis, they currently rely on gut instinct for their response strategy. A simulator lets you run a thousand simulations of different responses — apology, silence, counter-attack — and measure which de-escalates most effectively.

Radicalisation research. Understanding how moderate users radicalise during online arguments is critical for prevention. You can't ethically expose real people to radicalisation conditions. A simulator lets researchers test intervention strategies without putting anyone at risk.

Computational history. What would Twitter have looked like during the Cuban Missile Crisis? How would Reddit's algorithm have shaped discussion of Watergate? Simulating historical counterfactuals helps us prepare for future crises in the age of algorithmic amplification.

The Uncomfortable Part

A tool that accurately simulates online conflict can also be used to manufacture it. Everything Crucible does for researchers, it also does for propagandists: test manipulation strategies before deployment, optimise outrage engagement, generate realistic-looking synthetic discourse at scale.

This isn't a future risk — it's a present reality. State actors already use language models for influence operations. Publishing Crucible doesn't create a new capability; it makes an existing one more transparent. The people who would use this maliciously already have better-resourced tools. The researchers who need to study and counter these tactics often don't.

Sunlight, disinfectant, et cetera.

Honest Limitations

VADER was our biggest mistake. We used a rule-based sentiment classifier (VADER) as both our calibration sensor and our evaluation metric. VADER doesn't understand sarcasm, irony, or political rhetoric. It classified obviously hostile ALL CAPS rants as positive because words like "FREEDOM" and "PROTECT" and "AMERICA" have positive valence in its lexicon. This broke our calibration loop and likely distorts our reported metrics. With transformer-based classifiers like CardiffNLP's Twitter-RoBERTa, the calibration engine would actually get accurate readings, and the results might look very different. We couldn't load CardiffNLP due to dependency conflicts in our deployment environment.

Structural comparison is against estimates, not measurements. We don't have real conversation tree data from the specific events we simulated. Our "real" structural statistics (reply ratios, depth distributions) are estimates based on typical online conversation patterns. The 75% structural fidelity number is meaningful for showing that our network dynamics produce plausible tree shapes, but it's not a rigorous comparison against ground truth.

VADER measures VADER, not reality. The JSD number — our best metric — compares VADER-scored simulated text against CardiffNLP-scored real text. These are different classifiers with different biases. Our 0.031 JSD is real, but the comparison isn't apples-to-apples. It might be better or worse with a uniform evaluator.

8B models have ceilings. Llama 3 8B, even with careful prompting, sometimes falls into patterns: repetitive sentence structures, character breaks, generic hostility rather than human hostility. The "WAKE UP PEOPLE!!!" opener appeared too frequently across agents — a larger model with more capacity for varied expression would produce more diverse output.

25 events isn't 100. Budget and rate-limiting constraints meant we ran 25 paired events rather than the 100+ we'd want for publication-grade confidence intervals. Our metrics have wide standard deviations (JSD: 0.031 ± 0.024) that would narrow with more data.

Single-platform focus. All ground truth data comes from Twitter/X (via TweetEval). Whether these findings transfer to Reddit, YouTube comments, Facebook, or Telegram is an open question.

What We Actually Learned

We set out to build a system that beats the state of the art in social simulation. On the metric that matters most — sentiment distribution matching — we did: JSD of 0.031 versus a published benchmark of < 0.1. On structural fidelity, we didn't: 75% versus 93.5% (with the caveat that these numbers aren't directly comparable).

But the more interesting finding was about what works:

Personas are everything. The single most impactful piece of Crucible is the Persona Foundry — rich character documents that tell the LLM who it is, not just what to say. This is consistent with the broader trend in LLM research: prompting strategy matters more than parameter tuning.

Feedback loops need good sensors. Our PID calibration engine was theoretically sound but practically useless because VADER can't measure what we care about. This isn't an argument against closed-loop control — it's an argument for investing in your measurement infrastructure before building your controller.

Network dynamics are structurally essential. Reply trees, dogpile detection, and visibility cascading don't change what agents say, but they change the shape of the conversation in ways that matter for realism. Removing network dynamics halved our structural fidelity.

The politeness bias is a prompting problem. Modern instruction-tuned LLMs aren't inherently incapable of generating hostile discourse. They default to politeness when given generic prompts, but they'll fight dirty when given a character who would. The "bias" lives in the interaction design, not the weights.

If we were starting over, we'd spend zero time on the PID controller and all of our time on two things: better personas and better sentiment classifiers. The fancy feedback loop was a solution to a problem that good prompting already solves.

Which is, in retrospect, a finding that should make anyone building complex AI systems a little nervous.

Technical Appendix

Model Stack

Generator: meta/meta-llama-3-8b-instruct via Replicate API. Temperature: 0.85–0.95 (scaled by agent hostility). Top-p: 0.9. Max tokens: 150.

Evaluator: VADER Sentiment Analysis (rule-based fallback). Production targets: CardiffNLP twitter-roberta-base-sentiment-latest, twitter-roberta-base-emotion, twitter-roberta-base-offensive.

Calibration Controller

PID with gains Kp=0.4, Ki=0.1, Kd=0.05. Window: 5 turns. Hostility nudge bounded to [-1, 1]. Operates on negativity error signal (target − observed). Outputs prompt-level modifiers injected into system prompt.

Network Model

Directed graph (NetworkX). Reply target selection by social role: antagonists seek opposing views (stance distance > 0.5), amplifiers seek allies (distance < 0.3), bridges target highest-activity threads. Visibility decays 0.85× per turn. Dogpile threshold: 3 unique attackers in 3 turns.

Data

Ground truth from TweetEval (CardiffNLP): 45,615 labeled tweets (sentiment), 11,916 (offensive), 9,000 (hate speech). 200 conflict events from 32 scenario templates across 5 categories. 25 events used in paired experiments, 10 in ablation study.

Reproducibility

Full source at github.com/theCampel/darwin-monorepo. Requirements: Python 3.12, Replicate token (~$0.10 budget for 25 events). Runtime: ~2 hours at 6 req/min rate limit.

Raw Numbers

Paired Experiment (25 events, 8 agents × 5 turns):
  Baseline (no calibration):
    JSD:          0.036 ± 0.027
    Negativity:   46.0% ± 21.2%
    Struct Fid:   75.2% ± 8.4%
    Messages/evt: 14.2

  Calibrated (PID active):
    JSD:          0.031 ± 0.024
    Negativity:   45.7% ± 19.9%
    Struct Fid:   73.9% ± 6.4%
    Messages/evt: 14.1

Ablation (10 events, 4 configs):
  Full:             JSD=0.048  Struct=72.6%
  − Calibration:    JSD=0.028  Struct=71.1%  ← best JSD
  − Network:        JSD=0.039  Struct=31.7%  ← structural collapse
  − Dynamic:        JSD=0.032  Struct=74.6%

Cost: $0.07 total (1,325 API calls)