geo-platform-streaming

star 0

Streaming download of large GEO GPL platform annotation files. Use when downloading GPL platforms with multi-GB SOFT files (e.g. GPL5175=4.3GB, GPL5188=41.5GB). Downloads only the platform table section (~14-36MB) instead of full file. Output: parquet with probe_id + gene_symbol columns.

svarogjk By svarogjk schedule Updated 2/21/2026

name: geo-platform-streaming description: Streaming download of large GEO GPL platform annotation files. Use when downloading GPL platforms with multi-GB SOFT files (e.g. GPL5175=4.3GB, GPL5188=41.5GB). Downloads only the platform table section (~14-36MB) instead of full file. Output: parquet with probe_id + gene_symbol columns.

GEO Platform Streaming Download

Key Insight

GEO SOFT files are structured text with a platform annotation table delimited by:

!platform_table_begin
<header row>
<data rows>
!platform_table_end
<rest of file — megabytes/gigabytes of sample data we don't need>

Close the HTTP stream immediately after !platform_table_end to avoid downloading gigabytes of irrelevant sample data.

Real-world impact

Platform Full file size Bytes actually downloaded Rows extracted Time
GPL5175 4,319 MB 14 MB 78,907 9s
GPL5188 41,553 MB 36.5 MB 620,000 13s

Output Format

Parquet file at platform_mappings/GPL{id}_gene_mapping.parquet:

probe_id       object
gene_symbol    object

Identical schema to all other pre-built platform parquets used by gene_mapping_service.py.

URL Pattern

def _soft_url(platform_id: str) -> str:
    num = int(platform_id)
    prefix = f"GPL{(num // 1000)}nnn"
    return f"https://ftp.ncbi.nlm.nih.gov/geo/platforms/{prefix}/GPL{platform_id}/soft/GPL{platform_id}_family.soft.gz"

def _annot_url(platform_id: str) -> str:
    # annot files are smaller — try first, fall back to soft
    num = int(platform_id)
    prefix = f"GPL{(num // 1000)}nnn"
    return f"https://ftp.ncbi.nlm.nih.gov/geo/platforms/{prefix}/GPL{platform_id}/annot/GPL{platform_id}.annot.gz"

Always try the .annot.gz URL first (much smaller). Fall back to _family.soft.gz on 404.

Incremental gzip Decompressor

SOFT files are gzip-compressed. Use zlib.decompressobj(wbits=47) to decompress streaming chunks without buffering the full file:

import zlib

class StreamingDecompressor:
    def __init__(self):
        # wbits=47 → auto-detect zlib/gzip
        self._d = zlib.decompressobj(wbits=47)

    def decompress(self, data: bytes) -> bytes:
        return self._d.decompress(data)

Full Streaming Parser

import gzip, re, zlib
from typing import Optional
import httpx
import pandas as pd
from pathlib import Path

CACHE_DIR = Path("platform_mappings")
ID_NAMES = {"id", "probe_id", "probeset_id", "probe", "id_ref"}
GENE_KEYWORDS = ["gene", "symbol", "mrna", "assignment", "annotation"]


def stream_parse_platform(platform_id: str) -> Optional[pd.DataFrame]:
    """Stream-parse GPL SOFT file: only download the platform table section."""
    urls = [_annot_url(platform_id), _soft_url(platform_id)]

    for url in urls:
        try:
            with httpx.stream("GET", url, timeout=None, follow_redirects=True) as resp:
                if resp.status_code == 404:
                    continue
                resp.raise_for_status()
                result = _parse_stream(resp.iter_bytes(chunk_size=512 * 1024))
                if result is not None:
                    return result
        except httpx.HTTPStatusError:
            continue
    return None


def _parse_stream(byte_iter) -> Optional[pd.DataFrame]:
    decompressor = zlib.decompressobj(wbits=47)
    tail = b""
    in_table = False
    header_parsed = False
    id_col: Optional[int] = None
    gene_cols: list[int] = []
    mappings: dict[str, str] = {}

    for raw_chunk in byte_iter:
        text_data = decompressor.decompress(raw_chunk)
        combined = tail + text_data
        lines = combined.split(b"\n")
        tail = lines[-1]

        for raw_line in lines[:-1]:
            line = raw_line.decode("utf-8", errors="replace").rstrip()

            if line.startswith("!platform_table_begin"):
                in_table = True
                continue

            if line.startswith("!platform_table_end"):
                # ← Close connection here, stop downloading
                return pd.DataFrame(list(mappings.items()), columns=["probe_id", "gene_symbol"])

            if not in_table or line.startswith("!"):
                continue

            if not header_parsed:
                columns = [c.strip().strip('"') for c in line.split("\t")]
                id_col = next((i for i, c in enumerate(columns)
                               if c.strip("#").strip().lower() in ID_NAMES), None)
                gene_cols = [i for i, c in enumerate(columns)
                             if any(kw in c.lower() for kw in GENE_KEYWORDS)]
                if id_col is None or not gene_cols:
                    return None
                header_parsed = True
                continue

            fields = line.split("\t")
            if len(fields) > id_col:
                probe_id = fields[id_col].strip()
                if probe_id:
                    gene = _extract_gene_from_fields(fields, gene_cols)
                    if gene:
                        mappings[probe_id] = gene

    return pd.DataFrame(list(mappings.items()), columns=["probe_id", "gene_symbol"]) if mappings else None


def _extract_gene_from_fields(fields: list[str], gene_col_indices: list[int]) -> Optional[str]:
    for idx in gene_col_indices:
        if idx < len(fields):
            gene = _extract_gene(fields[idx])
            if gene:
                return gene
    return None


def _extract_gene(text: str) -> Optional[str]:
    if not text or text.strip() in {"---", "NA", "NULL", "", "null", "N/A"}:
        return None
    text = text.strip()
    # Symbol in parentheses: "description (SYMBOL)"
    m = re.search(r"\(([A-Za-z0-9_\-]+)\)", text)
    if m and re.match(r"^[A-Z][A-Za-z0-9_\-]*$", m.group(1)):
        return m.group(1)
    # First token before delimiters
    text = re.split(r"\s*///\s*|[;,|/]", text)[0].strip()
    text = re.sub(r"\s*\[.*?\]", "", text).strip().strip('"').strip("'")
    if text and re.match(r"^[A-Za-z0-9][A-Za-z0-9_\-\.]*$", text):
        return text
    return None

CLI Script

The ready-to-run script lives at backend/download_large_platforms.py:

# Download specific platform IDs
uv run python download_large_platforms.py 5175 5188

# Default targets (GPL5175, GPL5188) if no args given
uv run python download_large_platforms.py

Skips platforms that already have a parquet. Saves to platform_mappings/GPL{id}_gene_mapping.parquet.

When to Use This

  • When gene_mapping_service.py logs a platform with size > 500 MB in the pre-download check
  • When download_platforms.py would take hours or OOM for a platform
  • Specifically confirmed necessary for: GPL5175, GPL5188, GPL16686, GPL23159, GPL4133

Common Issues

Symptom Cause Fix
OSError in decompressor Server returned non-gzip Treat bytes as plain text
No !platform_table_end found Some files lack the marker Use collected mappings anyway
Empty gene columns Unusual column names Inspect first few columns, add keyword
404 on annot URL No annot file for this platform Falls back to soft automatically
Install via CLI
npx skills add https://github.com/svarogjk/caloric_restriction --skill geo-platform-streaming
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator