genomics-assembly

name: genomics-assembly description: Load when computing genome-assembly QC metrics — N50/N90, L50/L90, total length, contig count, GC content, longest-contig — from a FASTA produced by any assembler (SPAdes / Megahit / Flye / Canu). Skip when running the assembly itself or when assessing alignment quality (use genomics-alignment). version: 0.5.0 author: OmicsClaw license: MIT tags:

genomics
assembly
n50
l50
contig
quast
spades
flye requires:
pandas
numpy

When to use

The user has a FASTA from any de novo assembler (SPAdes, Megahit, Flye, Canu, etc.) and wants standard QUAST-compatible quality metrics: contig count, N50 / N90, L50 / L90, total length, longest contig, GC content, optional completeness fraction (when --genome-size is provided).

This skill does NOT run the assembly. It consumes the FASTA the assembler emits.

Inputs & Outputs

Input	Format	Required
Assembled contigs	`.fasta` / `.fa`	yes (unless `--demo`)
Expected genome size	`--genome-size <bp>` (used for completeness %)	optional

Output	Path	Notes
Per-contig table	`tables/contig_lengths.csv`	one row per contig with `contig` + `length` (no GC column — GC is assembly-wide only)
Assembly metrics	`tables/assembly_metrics.csv`	one-row N50/L50/total/longest summary
Report	`report.md` + `result.json`	always

Flow

Load FASTA (--input <assembly.fasta>) or generate a demo assembly at output_dir/demo_assembly.fasta (genome_assembly.py:186).
Parse contigs (genome_assembly.py:52-67); each line is uppercased on read so case is normalised.
Sort by length; compute cumulative N50 / N90 / L50 / L90 + total / longest + assembly-wide GC% from concatenated sequence.
If --genome-size is set and > 0, compute completeness_pct = total_length / genome_size * 100.
Write tables/contig_lengths.csv (genome_assembly.py:309) + tables/assembly_metrics.csv (:312) + report.md + result.json.

Gotchas

No assembler is invoked. This skill summarises an existing FASTA — it does not run SPAdes / Megahit / Flye / Canu. Run them upstream and feed the resulting FASTA here.
--input REQUIRED unless --demo. genome_assembly.py:290 raises ValueError("--input required when not using --demo"); non-existent paths raise FileNotFoundError at :293.
--genome-size 0 (default) skips completeness. Without an expected genome size (genome_assembly.py:279, default 0), the report omits the completeness column entirely. Pass --genome-size 3000000000 for a human-scale assembly to populate it.
Soft-masked bases are normalised to uppercase before GC counting. genome_assembly.py:67 calls line.upper() per FASTA line, so lowercase soft-masked regions contribute identically to hard-masked / unmasked sequence in the GC%. There is no way to exclude soft-masked regions short of pre-filtering the FASTA.
Demo FASTA has 100 contigs of varying length. --demo writes a fixed-pattern synthetic file useful for orchestrator smoke tests; the N50 it produces is not biologically meaningful.

Key CLI

# Demo
python omicsclaw.py run genomics-assembly --demo --output /tmp/asm_demo

# Real assembly with completeness against expected size
python omicsclaw.py run genomics-assembly \
  --input my_assembly.fasta --output results/ \
  --genome-size 3100000000

genomics-assembly