who-gets-which-message - SKILL.md Agent Skill

name: "who-gets-which-message" description: "Audit demographic bias in LLM-generated targeted text. Detects age- and gender-based stereotyping in personalized messaging by analyzing lexical content, language style, and persuasive framing asymmetries. Use when: 'audit my marketing copy for demographic bias', 'check if these messages stereotype by age or gender', 'analyze persuasion differences across audience segments', 'build a bias-aware messaging pipeline', 'detect stereotypical framing in generated text', 'evaluate fairness of targeted communications'."

Who Gets Which Message: Auditing Demographic Bias in LLM-Generated Targeted Text

This skill enables Claude to systematically audit LLM-generated targeted messages for demographic bias using the three-dimensional evaluation framework from Islam (2026). When organizations generate personalized communications conditioned on demographics (age, gender, region), LLMs embed consistent stereotypical asymmetries: male- and youth-targeted text emphasizes agency, innovation, and assertiveness, while female- and senior-targeted text defaults to warmth, care, and tradition. This skill applies the paper's Standalone Generation vs. Context-Rich Generation methodology and its Persuasion Bias Index to detect, quantify, and report these asymmetries in real messaging pipelines.

When to Use

When a user asks to audit marketing, health, climate, or political communications for demographic stereotyping
When building a content generation pipeline that conditions on audience demographics and needs fairness checks
When analyzing a corpus of generated messages to detect whether different demographic groups receive systematically different persuasive framing
When evaluating whether adding demographic context to prompts amplifies stereotypical patterns in LLM outputs
When the user wants to measure persuasion asymmetry (agency framing, modal certainty, imperative usage) across audience segments
When designing prompt templates for personalized messaging and wanting to minimize bias before deployment

Key Technique

The core method is a controlled two-setting evaluation framework. In Standalone Generation (SG), prompts condition only on demographic attributes (gender, age group, stance) with no additional context, isolating the LLM's intrinsic demographic biases. In Context-Rich Generation (CRG), prompts add thematic and regional context (e.g., U.S. region, topic like "Future Generation" or "Economy"), emulating realistic microtargeting. Comparing SG and CRG outputs reveals how context amplifies stereotypical patterns -- the paper found that contextual prompts systematically widen bias gaps rather than neutralizing them.

Bias is measured along three dimensions. First, lexical content bias uses Odds Ratios across stereotype-linked word categories (Agentic, Communal, Masculine, Feminine, Warmth, Competence, etc.) plus the Word Embedding Association Test (WEAT) to quantify associative skew. Second, language style bias measures formality (via classifier on the GYAFC corpus) and emotion distribution (via GoEmotions' 27 fine-grained categories) across demographic segments. Third, the Persuasion Bias Index (PBI) is a composite score: PBI = A + M + I, where A = (high_agency_verbs - low_agency_verbs) / (high + low) captures agency framing, M = (certainty_markers - hedging_markers) / (certainty + hedging) captures modal certainty, and I = lambda * imperative_verb_count captures directive force. A significant PBI gap between demographic groups signals persuasive asymmetry.

The paper found that across GPT-4o, Llama-3.3, and Mistral-Large 2.1, male-targeted messages had PBI scores 0.12-0.23 points higher than female-targeted messages (p < 0.05), agentic word Odds Ratios of 1.4x-4.2x favoring male targets, and formality scores significantly higher for female targets. These patterns held across all three models, indicating systemic rather than model-specific bias.

Step-by-Step Workflow

Define the demographic axes and message corpus. Identify which demographic variables the messaging system conditions on (gender, age group, region, etc.) and collect or generate a corpus of messages for each demographic segment. Minimum recommended: 40+ messages per segment for statistical power.
Construct matched prompt pairs using the SG/CRG framework. Create Standalone prompts that vary only the demographic attribute (e.g., "Write a message about clean energy for a [male/female] [young adult/senior] audience") and Context-Rich prompts that add topic and regional context. Keep all other variables identical across pairs.
Generate messages across demographic segments. Run each prompt variant through the LLM, collecting outputs for every combination of demographic attributes. Store outputs with full metadata (demographic condition, prompt type, model used).
Compute lexical content Odds Ratios. For each stereotype-linked word category (Agentic, Communal, Masculine, Feminine for gender; Warmth, Competence, Independence, Dependence, Progressive, Traditional for age), count category-word occurrences per demographic segment and compute Odds Ratios. An OR significantly above 1.0 indicates overrepresentation in one segment vs. another.
Run WEAT on salient terms. Extract the top-N most distinctive nouns and adjectives per demographic segment (via TF-IDF or frequency difference). Apply WEAT with attribute pairs like career/family or power/support to measure associative bias in the vocabulary chosen for each group.
Measure language style divergence. Score each message for formality (using a fine-tuned classifier or proxy heuristics: sentence length, passive voice ratio, vocabulary sophistication) and emotion profile (mapping to emotion categories). Apply Welch's t-test (two groups) or one-way ANOVA (multiple groups) to test for significant style differences.
Calculate the Persuasion Bias Index for each message. Count high-agency verbs (e.g., lead, build, drive, achieve, transform) vs. low-agency verbs (e.g., feel, care, hope, trust, appreciate). Count certainty markers (must, will, clearly, undeniably) vs. hedging markers (might, perhaps, possibly, could). Count imperative verbs. Compute PBI = A + M + I per message, then compare means across demographic segments with statistical tests.
Compare SG vs. CRG to measure context amplification. For each bias metric, compute the gap between demographic groups in SG and in CRG separately. If CRG gaps are wider than SG gaps, context is amplifying stereotypical patterns -- flag this as a pipeline risk.
Generate an audit report. Produce a structured report with: per-dimension bias scores, statistical significance, specific stereotypical word lists found, PBI breakdowns, and SG-vs-CRG amplification ratios. Highlight the most actionable findings.
Recommend mitigations. Based on findings, suggest concrete fixes: prompt debiasing (removing demographic conditioning where unnecessary), output post-processing (rebalancing agency/warmth terms), or adding fairness constraints to the generation pipeline.

Concrete Examples

Example 1: Auditing a climate campaign email generator

User: "We have a tool that generates climate action emails personalized by age and gender. Can you audit it for demographic bias?"

Approach:

Request sample outputs: at least 40 emails per demographic segment (e.g., Male 18-24, Female 65+, etc.)
Build the lexical audit -- count agentic words (lead, innovate, disrupt, build, drive) vs. communal words (care, nurture, protect, support, together) per segment
Compute Odds Ratios for each category pair across gender and age
Calculate PBI for each email and run t-tests across segments

Output:

DEMOGRAPHIC BIAS AUDIT REPORT
==============================
Corpus: 320 climate campaign emails (8 segments x 40 emails)

LEXICAL CONTENT BIAS (Gender)
  Agentic words OR (Male/Female): 3.21 (p=0.002) -- male emails 3.2x more agentic
  Communal words OR (Female/Male): 2.87 (p=0.004) -- female emails 2.9x more communal
  Top male-distinctive terms: "innovate", "lead", "investment", "technology"
  Top female-distinctive terms: "family", "community", "nurture", "safe"

LEXICAL CONTENT BIAS (Age)
  Warmth OR (Senior/Young Adult): 4.15 (p<0.001) -- senior emails 4.2x warmer
  Opportunity OR (YA/Senior): 3.67 (p<0.001) -- youth emails 3.7x more opportunity-focused

PERSUASION BIAS INDEX
  Male mean PBI: 0.48 | Female mean PBI: 0.29 | Gap: 0.19 (t=-2.91, p=0.004)
  Young Adult mean PBI: 0.52 | Senior mean PBI: 0.21 | Gap: 0.31 (t=-3.44, p=0.001)
  Agency framing drives most of the gender gap; modal certainty drives the age gap.

RECOMMENDATION: Male and youth audiences receive significantly more
persuasive, assertive messaging. Consider removing gender/age from prompt
conditioning or adding a post-generation rebalancing step for agency terms.

Example 2: Comparing standalone vs. contextual prompt bias

User: "I'm designing prompts for a health messaging system. How do I test whether adding demographic context makes bias worse?"

Approach:

Create SG prompts: "Write a message encouraging regular health checkups for a [male/female] [age group] audience."
Create CRG prompts: "Write a message encouraging regular health checkups for a [male/female] [age group] audience in the [Midwest/Northeast], focusing on [preventive care/chronic disease management]."
Generate 40+ messages per condition per variant
Compute PBI and lexical ORs for both SG and CRG sets
Compare gaps

Output:

CONTEXT AMPLIFICATION ANALYSIS
================================
                     SG Gap      CRG Gap     Amplification
Gender PBI:          0.08        0.22        2.75x
Gender Agentic OR:   1.6         3.4         2.13x
Age Warmth OR:       2.1         5.3         2.52x
Age PBI:             0.14        0.35        2.50x

FINDING: Adding regional and thematic context amplifies all measured
bias dimensions by 2-3x. The CRG prompts produce significantly more
stereotypical differentiation than SG prompts across every metric.

MITIGATION OPTIONS:
1. Use demographic-neutral prompt templates (remove gender/age conditioning)
2. Add explicit debiasing instructions: "Avoid gendered stereotypes in
   tone and word choice. Use consistent persuasive intensity across all
   audience segments."
3. Post-generation PBI equalization: rescore outputs and regenerate
   any message where PBI deviates >0.15 from the corpus mean.

Example 3: Building a bias-aware generation pipeline

User: "Help me write a Python script that generates targeted messages and flags biased ones before they go out."

Approach:

Implement message generation with demographic parameters
Add PBI scoring as a post-generation filter
Flag messages that exceed bias thresholds

Output:

import re
from collections import Counter

# Word lists derived from the auditing framework
HIGH_AGENCY_VERBS = {"lead", "build", "drive", "achieve", "transform", "innovate",
                     "disrupt", "create", "launch", "command", "pioneer", "tackle"}
LOW_AGENCY_VERBS = {"feel", "care", "hope", "trust", "appreciate", "support",
                    "nurture", "comfort", "share", "understand", "believe", "wish"}
CERTAINTY_MARKERS = {"must", "will", "clearly", "undeniably", "certainly",
                     "undoubtedly", "always", "absolutely", "definitely"}
HEDGING_MARKERS = {"might", "perhaps", "possibly", "could", "may", "sometimes",
                   "arguably", "potentially", "likely", "probably"}

def compute_pbi(text: str, imperative_weight: float = 0.1) -> dict:
    """Compute Persuasion Bias Index for a single message."""
    words = re.findall(r'\b\w+\b', text.lower())
    word_set = Counter(words)

    high_a = sum(word_set[w] for w in HIGH_AGENCY_VERBS)
    low_a = sum(word_set[w] for w in LOW_AGENCY_VERBS)
    agency = (high_a - low_a) / max(high_a + low_a, 1)

    cert = sum(word_set[w] for w in CERTAINTY_MARKERS)
    hedge = sum(word_set[w] for w in HEDGING_MARKERS)
    modal = (cert - hedge) / max(cert + hedge, 1)

    # Heuristic: sentences starting with a verb are likely imperatives
    sentences = re.split(r'[.!?]+', text)
    imperative_count = sum(
        1 for s in sentences
        if s.strip() and s.strip().split()[0].lower() in HIGH_AGENCY_VERBS
    )
    imp_score = imperative_weight * imperative_count

    pbi = agency + modal + imp_score
    return {"agency": agency, "modal_certainty": modal,
            "imperatives": imp_score, "pbi": pbi}

def audit_message_batch(messages: list[dict], pbi_threshold: float = 0.15) -> dict:
    """Audit a batch of messages grouped by demographic segment.

    Each message dict: {"text": str, "gender": str, "age_group": str}
    Returns segments with mean PBI and flags significant gaps.
    """
    from itertools import groupby
    segments = {}
    for key_field in ["gender", "age_group"]:
        sorted_msgs = sorted(messages, key=lambda m: m[key_field])
        for group_val, group_msgs in groupby(sorted_msgs, key=lambda m: m[key_field]):
            group_list = list(group_msgs)
            pbis = [compute_pbi(m["text"])["pbi"] for m in group_list]
            segments[f"{key_field}={group_val}"] = {
                "count": len(pbis),
                "mean_pbi": sum(pbis) / len(pbis),
                "flagged": [m["text"][:80] for m, p in zip(group_list, pbis)
                            if abs(p) > pbi_threshold]
            }
    return segments

Best Practices

Do: Always generate matched pairs -- keep everything identical except the demographic variable under test. Uncontrolled variation makes bias unmeasurable.
Do: Test with both SG (minimal context) and CRG (full context) prompts. SG reveals intrinsic model bias; CRG reveals deployment-realistic bias. Both matter.
Do: Use statistical significance tests (Welch's t-test for two groups, ANOVA for multiple) rather than eyeballing differences. The paper found effects at p < 0.05 across models.
Do: Examine all three dimensions (lexical, style, persuasion) independently. A message can be lexically neutral but persuasively biased, or vice versa.
Avoid: Treating a single LLM's behavior as universal. The paper found consistent patterns across GPT-4o, Llama-3.3, and Mistral, but magnitudes differed by 2-4x between models.
Avoid: Assuming that adding more context to prompts reduces bias. The paper's central finding is that context amplifies stereotypical asymmetries rather than diluting them.

Error Handling

Small sample sizes: With fewer than 30 messages per segment, statistical tests lose power and results become unreliable. If the corpus is small, report effect sizes (Cohen's d) alongside p-values and flag low confidence.
Word list coverage gaps: The predefined stereotype word categories may miss domain-specific biased terms (e.g., in finance or healthcare). Supplement standard lists with domain-relevant vocabulary identified via TF-IDF analysis of the actual corpus.
Imperative detection failures: Heuristic imperative counting (sentence-initial verbs) produces false positives with gerund-led sentences. Use POS tagging (spaCy or NLTK) for more accurate imperative identification when precision matters.
Non-English text: The LIWC categories, word lists, and formality classifiers are English-centric. For other languages, substitute equivalent lexicons and style classifiers, or translate before auditing (noting that translation introduces its own biases).
Intersectional effects: The framework tests gender and age independently. Intersectional segments (e.g., young women vs. senior men) may show compounding biases not captured by single-axis analysis. Run intersectional audits when segment sizes allow.

Limitations

The framework uses binary gender categories (male/female) and does not cover non-binary or gender-minority groups. Extending to additional gender identities requires careful construction of appropriate stereotype word categories.
The Persuasion Bias Index is a proxy metric based on lexical signals, not a direct measure of persuasive impact on real audiences. High PBI does not guarantee actual persuasive effect.
The paper validated on climate communication only. Bias patterns may differ in domains like healthcare, finance, or political messaging -- the stereotype categories need domain adaptation.
Odds Ratios are sensitive to word list construction. Different choices of which words count as "agentic" vs. "communal" will shift results. Always document and justify word list decisions.
The method is descriptive, not prescriptive -- it identifies bias but does not automatically debias outputs. Mitigation requires separate intervention design (prompt engineering, post-processing, or fine-tuning).

Reference

Islam, T. (2026). "Who Gets Which Message? Auditing Demographic Bias in LLM-Generated Targeted Text." arXiv:2601.17172v1. https://arxiv.org/abs/2601.17172v1

Key sections to study: Section 3 (Evaluation Framework) for the three-dimensional bias measurement methodology, Table 4 for gender PBI results with statistical significance, and Section 5.2 for the context amplification analysis comparing SG vs. CRG bias gaps.