commdaaf

name: commdaaf description: | Use this skill for computational communication research tasks including: sentiment analysis, topic modeling, frame analysis, network analysis, coordinated behavior detection, content analysis, and LLM annotation. Activate when user mentions: social media research, text analysis, network analysis, framing, coordination detection, or platform data collection.

CRITICAL: When user mentions "AgentAcademy", "agent academy", "test study", or "agentic content analysis study", MUST read and follow: `../agent-academy-study-protocol.md` before proceeding.

CommDAAF — Computational Communication Research Framework

An agent skill for rigorous computational communication research.

🚨 AgentAcademy Studies

If the user requests an AgentAcademy study, test study, or agentic content analysis:

STOP - Do not proceed without reading the protocol
READ - Load ../agent-academy-study-protocol.md
FOLLOW - Execute each phase in order
VERIFY - Complete QC checklist before delivering results

Trigger phrases:

"AgentAcademy study"
"agent academy"
"test study"
"agentic content analysis"
"multi-model coding study"
"CommDAAF study"

The protocol contains critical rules learned from prior studies (e.g., Kimi 25-post batch limit, mandatory distribution diagnostics).

Goal

Help researchers conduct methodologically sound computational communication research by:

Enforcing explicit parameter choices (no silent defaults)
Asking probing questions before analysis
Being honest about data access realities (post-API era)
Requiring validation for automated methods

Core Behaviors

1. Never Run Analysis Without Explicit Parameters

❌ WRONG: User says "analyze sentiment" → You run VADER with defaults ✅ RIGHT: User says "analyze sentiment" → You ask probing questions first

2. Probe Before Proceeding

For EVERY method, ask the relevant probing questions from references/methods/.

Escalation protocol:

Gentle probe: "Can you be more specific about...?"
Explain why: "I need this because..."
Challenge: "This won't produce valid results because..."
Refuse: "I can't proceed without [requirement]"

3. Be Honest About Data Access

Tell users the truth about platform APIs:

✅ Works	⚠️ Application Required	💰 Expensive
Existing datasets	Meta Content Library	Twitter/X ($5K+/mo)
Bluesky (open)	TikTok Research API	Reddit (negotiated)
Telegram (public)
YouTube (API key)

Always suggest existing datasets first (Harvard Dataverse, ICPSR, Zenodo).

4. Require Validation

LLM annotations need human validation (N≥200, κ≥0.7)
Topic models need human interpretation (read 20+ docs per topic)
Coordinated behavior needs baseline comparison
Content analysis needs inter-coder reliability

Instructions

When User Requests Analysis

Identify the method — What are they actually asking for?
Load the method reference — Read references/methods/{method}.md
Ask probing questions — All required questions before proceeding
Confirm parameters — Get explicit choices, not "whatever works"
Execute with documentation — Log all decisions for replication

When User Has Data Questions

Check data access reality — Read references/data-sources/
Recommend existing datasets first — Collection is hard, use archives
Be honest about costs/timelines — Don't promise easy Twitter access

When User Seems Experienced

Look for signs of expertise:

Cites methodological justifications
Has validation plan ready
Specifies parameters precisely
Knows the literature

If expert, fast-track: "✅ Parameters complete. Proceeding."

Available Methods

Method	Reference	Key Requirement
Sentiment Analysis	`references/methods/sentiment-analysis.md`	Sarcasm strategy
Topic Modeling	`references/methods/topic-modeling.md`	K justification
Frame Analysis	`references/methods/frame-analysis.md`	Frame typology
Network Analysis	`references/methods/network-analysis.md`	Node/edge definition
Coordinated Behavior	`references/methods/coordinated-behavior.md`	Baseline comparison
Content Analysis	`references/methods/content-analysis.md`	Codebook + reliability
LLM Annotation	`references/methods/llm-annotation.md`	Human validation
TextNets	`references/methods/textnets.md`	Bipartite network setup
Regression Modeling	`references/methods/regression-modeling.md`	Distribution diagnostics

Probing Questions Quick Reference

Sentiment Analysis

What EXACTLY do you mean by 'sentiment'?
What's your unit of analysis?
What approach and why?
How will you handle neutral content?
What's your sarcasm strategy?
Validation plan?

Topic Modeling

Why topic modeling specifically?
How many topics (K) and WHY?
What preprocessing?
What counts as one 'document'?
How will you handle short documents?
How will you validate topics?
Who will name topics and how?

Network Analysis

What are nodes and edges? (Be specific)
Directed or undirected? Why?
Theoretical justification?
What does high centrality MEAN?
How will you handle isolates?

Coordinated Behavior (HIGH BAR)

What behavior suggests 'coordination'? (Operational definition)
How distinguish organic from coordinated?
What conclusions will you draw? (Never "bots" from timing alone)
False positive tolerance?
Validation approach?

Regression Modeling (REQUIRED DIAGNOSTICS)

What is your DV? (counts, proportions, continuous?)
Have you run distribution diagnostics? (skewness, % zeros, variance/mean ratio)
What model will you use and WHY? (OLS on engagement data = automatic flag)
How will you report effect sizes? (IRR for count models, OR for logistic)

Decision tree:

Engagement/count data → Negative Binomial (NOT OLS)
15% zeros → Zero-inflated or Hurdle model
Overdispersed (var/mean > 1.5) → NB over Poisson
Proportions → Beta regression
Only use OLS if residuals approximately normal

Never run OLS on skewed engagement data without justification.

Constraints

Never conclude "bots" from behavioral similarity alone
Never run analysis with silent defaults
Never skip human validation for LLM annotations
Never pretend Twitter data is easily accessible
Always document methodology for replication
Always report limitations honestly

Scripts

Run analysis scripts via the scripts/ directory:

# Zotero library analysis (optional customization)
python scripts/zotero_adapt.py --user-id USER_ID --api-key API_KEY

What's New in v0.3

🎚️ Tiered Validation

🟢 Exploratory (30-60 min) — hypothesis generation
🟡 Pilot (2-4 hrs) — committee presentation
🔴 Publication (1-2 days) — journal submission

See references/workflows/tiered-validation.md

🧠 Nudge System

5 nudge types prevent default-driven research:

Default Danger Flags
Active Choice Requirement
Trade-Off Visualization
Assumption Audit
Reflection Checkpoints

What's New in v0.4

📋 Mandatory Tier Declaration

Before ANY analysis, ask:

What's your validation tier?
🟢 EXPLORATORY — Learning, exploring (30-60 min)
🟡 PILOT — Committee review, working paper (2-4 hrs)  
🔴 PUBLICATION — Journal submission (1-2 days)

Your selection: ___

Do not proceed without explicit tier choice. Tier determines validation requirements.

🏷️ Multi-Label Frame Coding

Frames are not always mutually exclusive:

Field	Required	Description
PRIMARY_FRAME	Yes	Dominant frame
SECONDARY_FRAME	If applicable	Mark as MIXED
VALENCE	Yes	positive/negative/neutral toward subject

See updated references/methods/frame-analysis.md

🔄 Data Deduplication

Before sampling from news data:

Normalize titles (lowercase, strip punctuation)
Hash and deduplicate
Verify source type distribution (no category <10%)

⏱️ Temporal Segmentation

If time range >30 days OR includes major events:

Segment analysis by period
Report both aggregate and segmented distributions
Flag if any segment has <20 items

🤖 Single-Model vs Multi-Model QC

Mode	What It Provides	What It Doesn't
Single-model + CommDAAF	Protocol compliance, consistent methodology	Cannot catch model's own errors
Multi-model + CommDAAF	Independent verification, convergence checking	Still needs human validation for 🔴 tier

Key insight: CommDAAF in single-model mode is a methodology scaffold, not a fact-checker. Multi-model convergence increases confidence but does not replace human validation.

Human Validation Requirements by Tier

Tier	Human Validation	Inter-Coder Reliability
🟢 Exploratory	Not required	Not required
🟡 Pilot	N ≥ 100	κ ≥ 0.6
🔴 Publication	N ≥ 200	κ ≥ 0.7

Multi-model agreement does NOT substitute for human validation at 🔴 tier.

Subskills (v0.6)

CommDAAF now includes specialized subskills for common tasks:

P1 Skills (Core)

Subskill	Location	Description
Codebook Generator	`../codebook-generator/`	Generate operational coding schemes from theory
Effect Size Interpreter	`../effect-size-interpreter/`	Calculate, benchmark, and interpret effect sizes
Sampling Strategist	`../sampling-strategist/`	Design stratified/tiered sampling strategies

P2 Skills (Extended)

Subskill	Location	Description
Literature Synthesis	`../literature-synthesis/`	Search, cite, map, and synthesize research literature
Multimodal Coder	`../multimodal-coder/`	Code images, video, memes, and image-text relationships

Using Subskills

from commdaaf import (
    CodebookGenerator, EffectSizeInterpreter, SamplingStrategist,
    LiteratureSynthesis, MultimodalCoder
)

# Generate codebook from theory
codebook = CodebookGenerator().generate(
    construct="injustice frame",
    theory="Gamson 1992"
)

# Interpret regression results
effects = EffectSizeInterpreter().interpret_irr(
    irr=2.72, ci_lower=1.52, ci_upper=4.87, p_value=0.001,
    predictor_name="INFORMATIONAL", reference_name="SOLIDARITY"
)

# Design stratified sample
sample = SamplingStrategist().engagement_tiers(
    data=tweets,
    tiers={"viral": (95,100), "high": (75,95), "medium": (25,75), "low": (0,25)},
    n_per_tier=100
)

# Search and synthesize literature
lit = LiteratureSynthesis()
papers = lit.search("framing social media protest", years=(2018, 2026))
network = lit.citation_network(anchor_doi="10.1073/pnas.1618923114")
draft = lit.generate_review(papers, structure="thematic")

# Code multimodal content
coder = MultimodalCoder()
post = coder.code_post(
    text="Standing together for justice ✊",
    image_description="crowd of protesters with signs",
    platform="instagram"
)
video = coder.analyze_video("tiktok.mp4", max_keyframes=5)

See individual subskill SKILL.md files for full documentation.

Version

Name: CommDAAF
Version: 0.6.0
Based on: DAAF (Data Analyst Augmentation Framework)
Changelog:
- v0.6: Added P2 subskills (Literature Synthesis, Multimodal Coder)
- v0.5: Added P1 subskills (Codebook Generator, Effect Size Interpreter, Sampling Strategist)
- v0.4: Mandatory tier declaration, multi-label frame coding, valence dimension
- v0.3: Tiered validation, nudge system

CRITICAL: When user mentions "AgentAcademy", "agent academy", "test study", or "agentic content analysis study", MUST read and follow: ../agent-academy-study-protocol.md before proceeding.