name: genomics-assembly
description: Load when computing genome-assembly QC metrics — N50/N90, L50/L90, total length, contig count, GC content, longest-contig — from a FASTA produced by any assembler (SPAdes / Megahit / Flye / Canu). Skip when running the assembly itself or when assessing alignment quality (use genomics-alignment).
version: 0.5.0
author: OmicsClaw
license: MIT
tags:
- genomics
- assembly
- n50
- l50
- contig
- quast
- spades
- flye requires:
- pandas
- numpy
genomics-assembly
When to use
The user has a FASTA from any de novo assembler (SPAdes, Megahit,
Flye, Canu, etc.) and wants standard QUAST-compatible quality
metrics: contig count, N50 / N90, L50 / L90, total length, longest
contig, GC content, optional completeness fraction (when
--genome-size is provided).
This skill does NOT run the assembly. It consumes the FASTA the assembler emits.
Inputs & Outputs
| Input | Format | Required |
|---|---|---|
| Assembled contigs | .fasta / .fa |
yes (unless --demo) |
| Expected genome size | --genome-size <bp> (used for completeness %) |
optional |
| Output | Path | Notes |
|---|---|---|
| Per-contig table | tables/contig_lengths.csv |
one row per contig with contig + length (no GC column — GC is assembly-wide only) |
| Assembly metrics | tables/assembly_metrics.csv |
one-row N50/L50/total/longest summary |
| Report | report.md + result.json |
always |
Flow
- Load FASTA (
--input <assembly.fasta>) or generate a demo assembly atoutput_dir/demo_assembly.fasta(genome_assembly.py:186). - Parse contigs (
genome_assembly.py:52-67); each line is uppercased on read so case is normalised. - Sort by length; compute cumulative N50 / N90 / L50 / L90 + total / longest + assembly-wide GC% from concatenated sequence.
- If
--genome-sizeis set and > 0, computecompleteness_pct = total_length / genome_size * 100. - Write
tables/contig_lengths.csv(genome_assembly.py:309) +tables/assembly_metrics.csv(:312) +report.md+result.json.
Gotchas
- No assembler is invoked. This skill summarises an existing FASTA — it does not run SPAdes / Megahit / Flye / Canu. Run them upstream and feed the resulting FASTA here.
--inputREQUIRED unless--demo.genome_assembly.py:290raisesValueError("--input required when not using --demo"); non-existent paths raiseFileNotFoundErrorat:293.--genome-size 0(default) skips completeness. Without an expected genome size (genome_assembly.py:279, default 0), the report omits the completeness column entirely. Pass--genome-size 3000000000for a human-scale assembly to populate it.- Soft-masked bases are normalised to uppercase before GC counting.
genome_assembly.py:67callsline.upper()per FASTA line, so lowercase soft-masked regions contribute identically to hard-masked / unmasked sequence in the GC%. There is no way to exclude soft-masked regions short of pre-filtering the FASTA. - Demo FASTA has 100 contigs of varying length.
--demowrites a fixed-pattern synthetic file useful for orchestrator smoke tests; the N50 it produces is not biologically meaningful.
Key CLI
# Demo
python omicsclaw.py run genomics-assembly --demo --output /tmp/asm_demo
# Real assembly with completeness against expected size
python omicsclaw.py run genomics-assembly \
--input my_assembly.fasta --output results/ \
--genome-size 3100000000
See also
references/parameters.md— every CLI flagreferences/methodology.md— N50 / L50 definitions, GC interpretationreferences/output_contract.md—tables/assembly_metrics.csvschema- Adjacent skills:
genomics-alignment(downstream — map reads back to your assembly to validate),genomics-qc(upstream — FASTQ QC before assembly),genomics-cnv-calling(parallel — copy-number on a known reference instead of de novo)