name: pvl-data-pipeline description: PVL prediction data pipeline. Use when working with peptide data flow, TANGO output, S4PRED predictions, FF-Helix calculations, normalization, or debugging single-vs-batch consistency issues. user-invocable: false
PVL Data Pipeline
Pipeline Flow (Both Single & Batch)
Input → DataFrame → FF-Helix → TANGO → S4PRED → Biochem → FF Flags → Normalize → API Response
Step-by-step:
- Create DataFrame with Entry, Sequence, Length columns
- ensure_ff_cols(df) —
auxiliary.ff_helix_percent()+ff_helix_cores()per row - ensure_computed_cols(df) — ensure all computed columns exist
- TANGO (if enabled):
tango.run_tango_simple(records)— runs binarytango.process_tango_output(df, run_dir)— parses output, adds SSW columnstango.filter_by_avg_diff(df, mode, stats)— computes SSW prediction flags
- S4PRED (if enabled):
s4pred.run_s4pred_database(df, mode, trace_id)— runs PyTorch models4pred.filter_by_s4pred_diff(df)— computes S4PRED SSW predictions
- calc_biochem(df) — Charge, Hydrophobicity, μH
- resolve_thresholds() + apply_ff_flags(df, thresholds, mode) — FF-SSW and FF-Helix flags
- _finalize_ui_aliases(df) + finalize_ff_fields(df) — clamp FF %, convert -1→None
- normalize_rows_for_ui(df) — DataFrame → camelCase API response dicts
Entry Points
| Flow | Route | Service |
|---|---|---|
| Single | api/routes/predict.py:14 |
services/predict_service.py:154 |
| Batch | api/routes/upload.py:24 |
services/upload_service.py:599 |
FF-Helix Calculation (auxiliary.py)
- Sliding window of
core_len=6residues - Per-residue helix propensity from
_HELIX_PROPdict (Chou-Fasman scale) - If window mean propensity >=
threshold=1.0, residues marked as "in core" - FF-Helix % = (residues in any qualifying window) / total_length * 100
- Pure function: deterministic, no external dependency
FF Flags (dataframe_utils.py:apply_ff_flags)
- FF-SSW flag: Based on SSW prediction + hydrophobicity threshold
- FF-Helix flag: Based on helix μH comparison with cohort average
- Thresholds configurable via
resolved_thresholdsdict - Returns actual thresholds used in
meta.thresholds
Normalization (normalize.py)
DataFrame row → row.to_dict() → PeptideSchema.parse_obj() → .to_camel_dict()
→ create_provider_status_for_row()
→ _convert_fake_defaults_to_null() # Nullify fields if provider OFF
→ _sanitize_for_json() # NaN/inf → None
→ PeptideRow.model_validate() # Final schema check
Key Column Mappings (CSV → API)
| DataFrame Column | API Key | Type |
|---|---|---|
| Entry | id | str |
| Sequence | sequence | str |
| SSW prediction | sswPrediction | -1/0/1/null |
| SSW score | sswScore | float/null |
| FF-Helix % | ffHelixPercent | float/null |
| Full length uH | muH | float/null |
| Charge | charge | float/null |
Single vs Batch MUST Match
These shared functions guarantee identical results:
auxiliary.ff_helix_percent()— pure, deterministicauxiliary.get_corrected_sequence()— AA sanitizationtango.process_tango_output()— stateless parsercalc_biochem()— pure calculationapply_ff_flags()— same thresholds → same flagsnormalize_rows_for_ui()— same PeptideSchema mapping
Invariant: Same sequence + same config → identical output in single and batch.
Debugging Data Issues
- Check provider status: Is TANGO/S4PRED actually running or OFF?
- Check
_convert_fake_defaults_to_null(): Does it nullify fields you expect to have data? - Check threshold mode: Are FF flags computed with expected thresholds?
- Check SSW diff:
Noneis valid (no helix-beta overlap), not "missing data" - Check single-item threshold: Uses fallback 0.0 when batch size <= 1