bio-genome-intervals-gtf-gff-handling

star 30

Parse, query, and convert GTF and GFF3 annotation files. Extract gene, transcript, and exon coordinates using gffread, gtfparse, and gffutils. Use when extracting specific features from gene annotations or converting between annotation formats.

mdbabumiamssm By mdbabumiamssm schedule Updated 2/4/2026

name: bio-genome-intervals-gtf-gff-handling description: Parse, query, and convert GTF and GFF3 annotation files. Extract gene, transcript, and exon coordinates using gffread, gtfparse, and gffutils. Use when extracting specific features from gene annotations or converting between annotation formats. tool_type: mixed primary_tool: gffread measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools: - read_file - run_shell_command

GTF/GFF Handling

GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.

Format Comparison

Feature GTF GFF3
Coordinate system 1-based, inclusive 1-based, inclusive
Hierarchy Implicit (gene_id, transcript_id) Explicit (Parent attribute)
Attribute format key "value"; key=value;
Comments # #
Fasta sequences Not standard ##FASTA directive

GTF Format

chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";

GFF3 Format

chr1	HAVANA	gene	11869	14409	.	+	.	ID=ENSG00000223972;Name=DDX11L1
chr1	HAVANA	mRNA	11869	14409	.	+	.	ID=ENST00000456328;Parent=ENSG00000223972
chr1	HAVANA	exon	11869	12227	.	+	.	ID=exon1;Parent=ENST00000456328

Parse GTF with gtfparse (Python)

Installation

pip install gtfparse

Basic Parsing

import gtfparse

# Load entire GTF
df = gtfparse.read_gtf('annotation.gtf')

# View columns
print(df.columns)
# ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame',
#  'gene_id', 'transcript_id', 'gene_name', ...]

# Filter by feature type
genes = df[df['feature'] == 'gene']
transcripts = df[df['feature'] == 'transcript']
exons = df[df['feature'] == 'exon']

# Get specific gene
gene_df = df[df['gene_name'] == 'TP53']

Extract Gene Coordinates

import gtfparse

df = gtfparse.read_gtf('annotation.gtf')

# All genes
genes = df[df['feature'] == 'gene'][['seqname', 'start', 'end', 'strand', 'gene_id', 'gene_name']]

# Convert to BED format (0-based)
genes_bed = genes.copy()
genes_bed['start'] = genes_bed['start'] - 1  # GTF is 1-based, BED is 0-based
genes_bed = genes_bed[['seqname', 'start', 'end', 'gene_name', 'gene_id', 'strand']]
genes_bed.to_csv('genes.bed', sep='\t', header=False, index=False)

Get Exons for Gene

import gtfparse

df = gtfparse.read_gtf('annotation.gtf')

# Get all exons for TP53
tp53_exons = df[(df['gene_name'] == 'TP53') & (df['feature'] == 'exon')]
tp53_exons = tp53_exons[['seqname', 'start', 'end', 'transcript_id', 'exon_number']]
print(tp53_exons)

Parse GFF with gffutils (Python)

Installation

pip install gffutils

Create Database

import gffutils

# Create database (slow first time, fast for subsequent queries)
db = gffutils.create_db('annotation.gff3', 'annotation.db',
                         force=True, merge_strategy='create_unique')

# Or load existing database
db = gffutils.FeatureDB('annotation.db')

Query Features

import gffutils

db = gffutils.FeatureDB('annotation.db')

# Count features by type
for featuretype in db.featuretypes():
    count = db.count_features_of_type(featuretype)
    print(f'{featuretype}: {count}')

# Get all genes
for gene in db.features_of_type('gene'):
    print(f'{gene.id}: {gene.seqid}:{gene.start}-{gene.end}')

# Get gene by ID
gene = db['ENSG00000141510']  # TP53
print(f'{gene.attributes["Name"][0]}: {gene.seqid}:{gene.start}-{gene.end}')

# Get children (transcripts, exons)
for transcript in db.children(gene, featuretype='mRNA'):
    print(f'  Transcript: {transcript.id}')
    for exon in db.children(transcript, featuretype='exon'):
        print(f'    Exon: {exon.start}-{exon.end}')

Get Introns

import gffutils

db = gffutils.FeatureDB('annotation.db')

# Get introns for a transcript
transcript = db['ENST00000269305']
introns = list(db.interfeatures(db.children(transcript, featuretype='exon'),
                                 new_featuretype='intron'))
for intron in introns:
    print(f'Intron: {intron.start}-{intron.end}')

Convert Formats with gffread (CLI)

Installation

conda install -c bioconda gffread

GTF to GFF3

gffread annotation.gtf -o annotation.gff3

GFF3 to GTF

gffread annotation.gff3 -T -o annotation.gtf

Extract Sequences

# Extract transcript sequences
gffread -w transcripts.fa -g genome.fa annotation.gtf

# Extract CDS sequences
gffread -x cds.fa -g genome.fa annotation.gtf

# Extract protein sequences
gffread -y proteins.fa -g genome.fa annotation.gtf

Filter Features

# Keep only protein-coding genes
gffread annotation.gtf -C -o coding.gtf

# Keep specific gene types
gffread annotation.gtf --keep-genes=protein_coding -o coding.gtf

Extract Regions with bedtools

Get Promoters

# Extract TSS (transcript start sites)
awk '$3 == "transcript"' annotation.gtf | \
    awk -v OFS='\t' '{
        if ($7 == "+") print $1, $4-1, $4, ".", ".", $7;
        else print $1, $5-1, $5, ".", ".", $7;
    }' > tss.bed

# Get promoter regions (2kb upstream of TSS)
bedtools flank -i tss.bed -g genome.txt -l 2000 -r 0 -s > promoters.bed

Get Gene Bodies

# Extract gene coordinates to BED
awk '$3 == "gene"' annotation.gtf | \
    awk -v OFS='\t' '{
        split($0, a, "gene_id \""); split(a[2], b, "\"");
        print $1, $4-1, $5, b[1], ".", $7;
    }' > genes.bed

Get Exons

# Extract unique exons
awk '$3 == "exon"' annotation.gtf | \
    awk -v OFS='\t' '{print $1, $4-1, $5, ".", ".", $7}' | \
    sort -k1,1 -k2,2n | uniq > exons.bed

Python: GTF to BED Conversion

import gtfparse
import pandas as pd

def gtf_to_bed(gtf_path, feature_type='gene', output_path=None):
    '''Convert GTF features to BED format.'''
    df = gtfparse.read_gtf(gtf_path)
    features = df[df['feature'] == feature_type].copy()

    # Convert to 0-based coordinates
    bed = pd.DataFrame({
        'chrom': features['seqname'],
        'start': features['start'] - 1,
        'end': features['end'],
        'name': features.get('gene_name', features.get('gene_id', '.')),
        'score': 0,
        'strand': features['strand']
    })

    if output_path:
        bed.to_csv(output_path, sep='\t', header=False, index=False)
    return bed

# Usage
genes_bed = gtf_to_bed('annotation.gtf', 'gene', 'genes.bed')
exons_bed = gtf_to_bed('annotation.gtf', 'exon', 'exons.bed')

Validate GTF/GFF

# Check GTF format
gffread -E annotation.gtf

# Check GFF3 format
gffread -E annotation.gff3

# Detailed validation
gt gff3validator annotation.gff3  # requires genometools

Common Attributes

GTF Attributes

Attribute Description
gene_id Ensembl gene ID
gene_name Gene symbol
gene_biotype protein_coding, lncRNA, etc.
transcript_id Ensembl transcript ID
transcript_name Transcript symbol
exon_number Exon position in transcript
exon_id Ensembl exon ID

GFF3 Attributes

Attribute Description
ID Unique feature identifier
Name Display name
Parent Parent feature ID
Dbxref Database cross-references
gene_biotype Gene type

Memory-Efficient Processing

import gtfparse

# Process large files in chunks (gtfparse loads all into memory)
# For very large files, use gffutils database approach

# Or filter during parsing
df = gtfparse.read_gtf('annotation.gtf',
                        features=['gene', 'exon'])  # Only load specific features

Related Skills

  • bed-file-basics - BED format and conversion
  • interval-arithmetic - Gene/exon overlap analysis
  • proximity-operations - TSS proximity analysis
  • differential-expression/de-results - Gene coordinate mapping
Install via CLI
npx skills add https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- --skill bio-genome-intervals-gtf-gff-handling
Repository Details
star Stars 30
call_split Forks 7
navigation Branch main
article Path SKILL.md
More from Creator
mdbabumiamssm
mdbabumiamssm Explore all skills →