autonomous-multi-agent-ai-high-throughput - SKILL.md Agent Skill

name: "autonomous-multi-agent-ai-high-throughput" description: | Build multi-agent AI systems for high-throughput scientific workflows with metacognitive self-assessment. Implements the Polymer Research Lifecycle (PRL) architecture: a Planner Agent decomposes complex scientific tasks into subtasks assigned to specialized domain agents (Research, Characterization, ML Model, Safety, Synthesis, Execution, Reporting), which produce consensus predictions with uncertainty estimates and continuously self-optimize via three-layer metacognitive reflection. Trigger phrases: - "Build a multi-agent pipeline for materials property prediction" - "Create a high-throughput screening system with agent consensus" - "Implement metacognitive self-assessment for an agent swarm" - "Design an autonomous scientific workflow with specialized agents" - "Set up a polymer informatics pipeline with uncertainty quantification" - "Orchestrate domain-specific agents for computational chemistry"

Autonomous Multi-Agent AI for High-Throughput Scientific Workflows

This skill enables Claude to architect and implement multi-agent systems that follow the Polymer Research Lifecycle (PRL) pipeline pattern from Roy et al. (2026). The core idea: a central Planner Agent decomposes complex scientific or engineering tasks into subtasks dispatched to specialized domain agents, each with distinct tools and models. Agents return independent predictions that are aggregated via consensus for uncertainty quantification, while a three-layer metacognitive framework (tactical, strategic, meta-strategic) monitors agent effectiveness and dynamically adjusts execution strategies. This pattern generalizes beyond polymers to any domain requiring high-throughput prediction, multi-model consensus, and self-improving agent orchestration.

When to Use

When the user wants to build a multi-agent system where different agents have distinct domain expertise (e.g., data retrieval, ML inference, simulation, safety checking)
When implementing high-throughput screening pipelines that must process thousands of inputs with property predictions and uncertainty estimates
When designing agent orchestration with a planner/dispatcher pattern that decomposes complex tasks into specialist subtasks
When the user needs consensus-based uncertainty quantification by running multiple independent prediction models and aggregating results with confidence intervals
When building self-improving agent systems that track their own performance, identify weak agents, and adjust strategies over time
When creating scientific workflows that chain retrieval, feature engineering, prediction, generation, and safety filtering into an end-to-end pipeline

Key Technique

Hierarchical Agent Orchestration with Consensus. The PRL architecture uses a four-layer pipeline: (1) Data Ingestion, where a centralized repository integrates heterogeneous sources; (2) Preprocessing and Feature Engineering, where domain-specific tokenizers and encoders produce embeddings; (3) Agent Processing, where specialized agents (Research, Characterization, ML Model, Safety, Synthesis, Execution, Reporting) perform domain computations; and (4) Output Integration, where results are aggregated, visualized, and reported. The Planner Agent sits at the top, decomposing user requests into subtasks, assigning them to the appropriate specialist, and merging outputs. For prediction tasks, multiple independent agents (e.g., a GNN agent, a descriptor-based predictor, and a simulation agent) produce separate estimates. The consensus prediction is the mean, and uncertainty is the standard deviation across agents, giving calibrated confidence intervals without expensive Bayesian methods.

Metacognitive Self-Assessment. The system implements three reflection layers that run after each task cycle. Tactical reflection evaluates individual agent operations (Did the Research Agent find relevant papers? Did the ML Agent's predictions fall within expected error bounds?). Strategic reflection evaluates overall progress toward the research objective and pipeline efficiency. Meta-strategic reflection tracks learning patterns across multiple cycles, identifying persistent weaknesses and triggering corrective actions such as curriculum-like retraining objectives for underperforming agents. Each agent receives an effectiveness score; agents scoring below population average are flagged for adjustment. In the paper's polystyrene case study, this mechanism detected that the Synthesis Agent (score 0.30) and Research Agent (score 0.57) were underperforming and dynamically generated improvement objectives.

Linear Scalability via Parallel Dispatch. By decomposing work into independent subtasks, the system achieves O(n) time complexity scaling to 10,000+ items. Parallel agent execution yields ~5x speedup on multi-core systems. This makes the pattern suitable for high-throughput screening where thousands of candidates must be evaluated cheaply.

Step-by-Step Workflow

Define the agent roster and their tool access. Create a configuration specifying each specialist agent's name, role description, available tools (e.g., RDKit for molecular descriptors, a GNN model for graph-based prediction, an API for literature retrieval), and the LLM backing each agent. Use a JSON or YAML manifest:

{
  "agents": [
    {"name": "research", "role": "Retrieve and summarize scientific literature", "tools": ["arxiv_search", "semantic_scholar"]},
    {"name": "ml_model", "role": "Run property predictions via trained models", "tools": ["polygnn", "property_predictor"]},
    {"name": "safety", "role": "Screen candidates against safety and feasibility criteria", "tools": ["safety_db", "toxicity_checker"]},
    {"name": "execution", "role": "Orchestrate task flow, handle errors, verify consistency", "tools": ["task_queue", "logger"]}
  ]
}

Implement the Planner Agent as a task decomposer. The Planner receives the user's high-level request, breaks it into atomic subtasks with explicit inputs/outputs, assigns each to the appropriate specialist agent, and defines the dependency graph. Use structured output (JSON) so downstream agents receive typed inputs:
```
def plan_task(request: str) -> list[Subtask]:
    # LLM call to decompose request into subtasks
    # Each subtask has: id, agent_name, input_schema, output_schema, depends_on
    ...
```
Build the four-layer data pipeline. Layer 1: Ingest raw data (SMILES strings, experimental measurements, literature references) into a unified store. Layer 2: Preprocess into model-ready features (molecular graphs, fingerprint vectors, normalized descriptors). Layer 3: Dispatch to specialist agents. Layer 4: Aggregate and format outputs.
Implement consensus prediction with uncertainty. For any prediction target, run at least two independent agents (different model architectures or data representations). Compute the consensus as the mean of agent predictions and uncertainty as the standard deviation:
```
predictions = [agent.predict(input) for agent in prediction_agents]
consensus = np.mean(predictions)
uncertainty = np.std(predictions)
result = {"value": consensus, "uncertainty": uncertainty, "agent_predictions": predictions}
```
Add safety and feasibility filtering. Before any candidate reaches the output stage, route it through the Safety Agent, which checks against domain constraints (physical plausibility, toxicity thresholds, regulatory compliance, cost bounds). Reject or flag candidates that fail.
Implement the three-layer metacognitive loop. After each task cycle, run reflection:
- Tactical: Score each agent's output quality (e.g., prediction error, retrieval relevance, task completion rate). Store as {"agent": "research", "effectiveness": 0.57}.
- Strategic: Compute pipeline-level metrics: efficiency = (accuracy * success_rate) / normalized_time. Compare against target thresholds.
- Meta-strategic: Across multiple cycles, detect trends. If an agent's effectiveness is consistently below the population average, generate a corrective action (swap model, augment training data, adjust prompt).
Enable parallel dispatch for throughput. For independent subtasks (e.g., predicting properties for a batch of 1,000 molecules), dispatch them in parallel across agent instances. Use async execution or a task queue:
```
async def screen_batch(items: list[str]) -> list[Result]:
    tasks = [predict_with_consensus(item) for item in items]
    return await asyncio.gather(*tasks)
```
Wire up the generative design loop (if applicable). For design tasks, chain: user requirements -> candidate generation (LLM-based) -> property prediction (ML agents) -> safety screening -> scoring (novelty, feasibility, creativity) -> ranked output. Implement as a closed loop where top candidates can be fed back for refinement.
Add structured reporting. The Reporting Agent produces a summary with: predictions table, uncertainty intervals, agent agreement metrics, flagged issues, and metacognitive scores. Output as markdown, JSON, or a visualization.
Test with a reference case. Validate the pipeline end-to-end on a known input (e.g., polystyrene, SMILES: CC(c1ccccc1)) where ground truth is available. Verify that consensus predictions fall within experimental ranges and that the metacognitive loop correctly identifies agent performance issues.

Concrete Examples

Example 1: High-throughput polymer property screening

User: "I have a CSV of 500 polymer SMILES strings. Predict glass transition temperature and density for each, with uncertainty estimates."

Approach:

Ingest the CSV, validate SMILES strings, convert each to a molecular graph and fingerprint vector.
Define two prediction agents: a GNN-based agent operating on molecular graphs and a descriptor-based agent operating on fingerprint features.
Dispatch predictions in parallel batches of 50.
For each polymer, compute consensus (mean) and uncertainty (std dev) across the two agents.
Run the Safety Agent to flag any SMILES that failed parsing or produced physically implausible values (e.g., negative density).
Output a results CSV with columns: smiles, tg_predicted, tg_uncertainty, density_predicted, density_uncertainty, flagged.

Output:

smiles,tg_predicted_K,tg_uncertainty_K,density_predicted_gcc,density_uncertainty_gcc,flagged
CC(c1ccccc1),378.2,12.7,1.021,0.027,false
C(=O)(O)CC(=O)O,285.4,8.3,1.312,0.015,false
...

Example 2: Metacognitive agent self-improvement

User: "My multi-agent pipeline has a research retrieval agent that keeps returning irrelevant papers. How do I add self-assessment?"

Approach:

After each retrieval task, score relevance by having a separate evaluation prompt rate each returned paper against the query (0-1 scale).
Store scores in a tactical reflection log: {"cycle": 12, "agent": "research", "effectiveness": 0.42, "task": "retrieve Tg data for polyamides"}.
After every N cycles, compute the strategic metric: compare research agent effectiveness against the population average of all agents.
If below average for 3+ consecutive cycles, trigger meta-strategic action: adjust the retrieval prompt template, switch search APIs, or add a re-ranking step.
Log the corrective action and track whether effectiveness improves in subsequent cycles.

Output (metacognitive dashboard):

{
  "cycle": 15,
  "agent_scores": {
    "research": {"effectiveness": 0.42, "trend": "declining", "action": "switching to semantic_scholar API + adding reranker"},
    "ml_model": {"effectiveness": 0.87, "trend": "stable", "action": "none"},
    "safety": {"effectiveness": 0.91, "trend": "stable", "action": "none"}
  },
  "pipeline_efficiency": 0.73,
  "population_average": 0.68
}

Example 3: Generative polymer design with multi-objective constraints

User: "Design biodegradable polymers with Tg between 70-90C and low production cost."

Approach:

The Planner Agent decomposes this into: (a) retrieve bio-based monomer library, (b) generate candidate structures via LLM, (c) predict Tg and cost-related properties for each candidate, (d) filter by constraints, (e) score and rank.
The Research Agent retrieves a list of bio-based monomers from literature databases.
A generative LLM agent proposes 50 candidate polymer structures as SMILES, drawing from the monomer library.
The ML Model agents predict Tg and density for each candidate via consensus.
The Safety Agent filters out candidates with known toxicity or regulatory issues.
Remaining candidates are scored on novelty (vs. existing polymers in database), feasibility (synthetic accessibility), and predicted property match.
Return top 10 ranked candidates with full property predictions and uncertainty.

Output:

Rank | SMILES               | Tg (C)     | Density    | Novelty | Feasibility | Score
1    | OC(=O)C(O)C(=O)O... | 78.3 ± 4.1 | 1.24±0.02  | 0.85    | 0.92        | 0.88
2    | CC(O)C(=O)OC(C)...  | 82.1 ± 5.7 | 1.18±0.03  | 0.79    | 0.88        | 0.84
...

Best Practices

Do: Use at least two independent prediction agents with different architectures or feature representations for consensus. A single model gives a point estimate; consensus gives calibrated uncertainty.
Do: Make the Planner Agent's task decomposition explicit and structured (JSON subtasks with typed inputs/outputs and dependency edges). This enables parallel dispatch and reproducible orchestration.
Do: Implement the metacognitive loop as a lightweight post-processing step, not inline with prediction. It should read logs, compute scores, and emit corrective actions without blocking the main pipeline.
Do: Track agent effectiveness scores over time in a persistent log. Trends matter more than individual scores for identifying systematic problems.
Avoid: Running all agents sequentially when subtasks are independent. The throughput gains from parallel dispatch are the primary advantage of this architecture.
Avoid: Hard-coding agent behavior. The metacognitive framework's value comes from dynamic adjustment. If you detect an underperforming agent but take no corrective action, the reflection layer is wasted computation.
Avoid: Treating consensus uncertainty as a proper Bayesian posterior. It is a heuristic measure of inter-model agreement. High agreement does not guarantee correctness if all models share the same systematic bias.

Error Handling

Agent failure or timeout: The Execution Agent should implement retry logic with exponential backoff. If an agent fails after retries, mark its prediction as missing and compute consensus from remaining agents. Log the failure in the tactical reflection.
Consensus divergence: If the standard deviation across agents exceeds a domain-specific threshold (e.g., >30K for Tg), flag the prediction as low-confidence rather than reporting a misleading mean. In the paper, RadonPy overestimated polystyrene Tg by ~80C vs. ML agents, which is detectable via this threshold.
Invalid inputs: Validate all inputs at the Data Ingestion layer. Reject malformed SMILES, out-of-range values, or unsupported data types before they reach prediction agents. Return clear error messages with the failing input.
Metacognitive false positives: An agent may score low on one cycle due to a hard input, not a systemic problem. Require a trend (3+ consecutive below-average cycles) before triggering corrective actions.
Scaling bottlenecks: If parallel dispatch saturates resources, implement adaptive batching that adjusts batch size based on available compute, maintaining linear throughput scaling.

Limitations

Consensus is not calibration. Multi-agent agreement measures inter-model variance, not true epistemic uncertainty. If all models are trained on the same biased data, consensus will be confidently wrong. Supplement with domain validation against known experimental values.
Metacognitive overhead. The three-layer reflection adds latency and complexity. For simple, well-characterized prediction tasks where models are already validated, skip meta-strategic reflection and use only tactical scoring.
Domain specificity of agents. Specialist agents (Characterization, Synthesis) encode domain knowledge specific to polymer science. Adapting this pattern to a new domain (e.g., catalysis, drug design) requires redefining agent roles, tools, and evaluation criteria from scratch.
LLM-generated candidates may be chemically invalid. The generative design loop depends on LLM-proposed structures. Without a validity checker (e.g., RDKit sanitization), a significant fraction of candidates may be nonsensical. Always filter generated structures before prediction.
The pattern works best with 3+ diverse prediction methods. With only two agents, uncertainty estimates are crude (just the absolute difference). The consensus mechanism improves with more independent predictors.

Reference

Roy, M., Bazgir, A., Santos, A. d. S. S., & Zhang, Y. (2026). Autonomous Multi-Agent AI for High-Throughput Polymer Informatics: From Property Prediction to Generative Design Across Synthetic and Bio-Polymers. arXiv:2602.00103v1. https://arxiv.org/abs/2602.00103v1

Key sections to study: the eight-agent roster and Planner Agent decomposition pattern (Section 2), the consensus uncertainty mechanism (Section 3, Table 6), the metacognitive self-assessment framework with tactical/strategic/meta-strategic layers (Section 4, Tables 8-9), and the polystyrene end-to-end case study demonstrating all components working together (Section 5).