name: ce-bioinfo-qc description: "Run sequencing and omics QA before downstream analysis. Use for FASTQ, BAM/CRAM, VCF, count matrices, methylation arrays, sample-swap checks, and batch-effect screening." argument-hint: ", optional: --modality wgs|wes|rnaseq|chipseq|methyl|atac|microarray"
Bioinformatics Data QA Gate
Skill Value
- Problem it solves: Sequencing and omics projects can move into analysis before raw data quality, sample identity, and batch confounding are checked.
- Use when: The user mentions FASTQ, BAM/CRAM, VCF, RNA-seq, WGS/WES, ATAC-seq, methylation arrays, FastQC, MultiQC, sample swaps, or omics batch effects.
- Output: A GO/NO-GO omics QA summary with modality-specific checks and blockers for downstream analysis.
- Ask only if: Only when modality, sample sheet, batch variable, or case/control grouping cannot be inferred.
- Do not do: Do not run differential expression, association testing, or clinical interpretation.
- Interaction: Check repo/config/chat evidence first. Ask one decision-changing question at a time; use the current harness's blocking question UI when available, otherwise present numbered choices and wait.
The omics counterpart to /ce-data-qa. Sequencing and array data fail in domain-specific ways (low base quality, adapter contamination, sample swaps, batch confounds with condition); this skill runs the right QC for the modality.
When this skill activates
- A new sequencing run / array batch was just registered as a data wave
- Before differential expression / variant calling / methylation analysis
- After a re-sequence following a sample-quality-fail
- Before sharing data to a collaborator (sanity check)
- Manual:
/ce-bioinfo-qc /data/runs/run_2025_04 --modality rnaseq
Prerequisites
- A sample sheet (CSV/TSV) listing samples, files, and at minimum a
conditionandbatchcolumn - Tools available locally OR via Conda:
fastqc,multiqc,samtools,picard,mosdepth,somalier(sample swap),picard CrosscheckFingerprints - For methylation:
minfi/sesame(R) - For RNA-seq:
salmonorSTARoutputs (or raw FASTQ)
Core workflow
Step 1: Modality detection
If --modality not passed, sniff the file extensions and tool outputs:
*.fastq.gzonly → fastq stage*.bam/*.cram→ alignment stage- counts matrix (
counts.tsv,salmon.sf) → quantification stage .idat/_Methylation_*.csv→ methylation array- VCF → variant calling stage
Different modality → different checks. Refuse to proceed if conflicting signals (mixed FASTQ + count matrix without explicit modality).
Step 2: Run modality-specific checks
| Modality | Tool | Check |
|---|---|---|
| FASTQ (any) | FastQC | per-base quality, adapter content, GC distribution, kmer content |
| FASTQ (any) | MultiQC | aggregate report; sample-to-sample outliers |
| BAM (WGS/WES) | samtools flagstat | mapping rate, duplicate rate, paired-properly rate |
| BAM (WGS/WES) | mosdepth | coverage uniformity, mean depth, % on-target |
| BAM (RNA-seq) | RSeQC / Picard CollectRnaSeqMetrics | strand specificity, intronic/intergenic % |
| BAM (any) | somalier extract + relate | sample swap detection (genotype concordance to expected sex / pedigree) |
| BAM (any) | Picard CrosscheckFingerprints | sample swap via fingerprint VCF if available |
| Counts matrix | edgeR / DESeq2 PCA | sample clustering vs declared condition / batch |
| Counts matrix | RUVSeq / sva / ComBat-Seq | batch-effect candidate detection (DOES NOT remove; just flags) |
| Methylation IDAT | minfi qcReport | bisulfite conversion efficiency, detection p-value |
| Methylation array | sesame | sex-prediction discordance |
| VCF | bcftools stats | Ti/Tv ratio, het/hom ratio, novel variant fraction |
Step 3: Batch-effect screening (no removal)
Run PCA on the count or beta matrix. Color points by batch (date / lane / plate / center) and condition (the experimental variable). If the first 2 PCs separate by batch AND condition is not orthogonal to batch (Pearson correlation between batch indicator and condition indicator > 0.3), emit a P0 finding:
Batch is confounded with condition. Removing the batch effect (ComBat / RUV / SVA) will also remove condition signal. Re-block the experiment or add explicit batch correction in the model.
Step 4: Sample swap screening
For BAM data, run somalier relate against the expected pedigree (or against declared sample sex). For each sample where the predicted sex / kinship doesn't match the sample sheet → P0 finding sample-swap-suspected.
Step 5: Report
Write reports/bioinfo-qa/<wave_id>.html (MultiQC-rendered) and reports/bioinfo-qa/<wave_id>.md summary:
- Modality and N samples
- Per-sample pass/warn/fail table
- Batch confound assessment
- Sample swap assessment
- Aggregate metrics (mean coverage, mean Q score, % aligned, etc.)
- Sign-off block
Step 6: Emit GO/NO-GO
__CE_BIOINFO_QC_PASS__ or __CE_BIOINFO_QC_FAIL__ wave=<id> blockers=<count>
What this skill does NOT do
- Does not modify the data (read-only)
- Does not remove batch effects (only flags; removal is a downstream choice)
- Does not call variants or quantify expression (use a pipeline reviewed by
ce-bioinfo-pipeline-reviewer) - Does not de-identify (genomic data has its own re-identification rules; consult IRB)
References
@./references/modality-checks.md
@./references/batch-confound-rules.md