fts-development-branch-version-upgrade

name: fts-development-branch-version-upgrade description: Automates Fine-Tuning Scheduler development branch version upgrade process including basic metadata oriented code changes, documentation updates, environment setup, testing, and coverage collection. Use when starting development on a new FTS minor release (e.g., 2.10.0 -> 2.11.0). license: Apache-2.0 metadata: author: speediedan version: '1.0' compatibility: Requires bash, git, uv, Python 3.10+, and access to local FTS repository at ~/repos/finetuning-scheduler

FTS Development Branch Version Upgrade Skill

This skill automates the process of upgrading the Fine-Tuning Scheduler (FTS) development branch to a new version, including all necessary metadata oriented code changes (for now, may have more ambitious goals for code fixes in the future), documentation updates, environment setup, and validation steps.

When to Use This Skill

Use this skill when:

Starting development on a new FTS minor or major release
Need to bump PyTorch minimum/maximum supported versions
Upgrading CUDA versions for Docker builds
Need to synchronize version metadata across all FTS files

Required User Inputs

Before running this skill, gather the following information:

Current FTS version (e.g., 2.10.0) - typically found in src/finetuning_scheduler/__about__.py with .dev0 suffix
New FTS version (e.g., 2.11.0) - the target version to upgrade to
New PyTorch minimum version (e.g., 2.7.0) - oldest PyTorch version to support (support 4-5 minor versions)
New PyTorch maximum version (e.g., 2.11.0) - latest PyTorch version to test
New PyTorch nightly version (e.g., dev20260121) - target nightly build for development
New CUDA version (e.g., 13.0.2) - CUDA toolkit version for Docker images
[Optional] Updated TORCH_CUDA_ARCH_LIST (e.g., "7.5;8.0;8.6;9.0;10.0;12.0+PTX") - mirror upstream PyTorch
[Optional] New Lightning min/max versions - if Lightning compatibility changes
[Optional] Report output location - defaults to ~/repos/distributed-insight/project_admin/finetuning-scheduler/handoff_docs/ or /tmp/

Prerequisites

FTS repository checked out locally at ~/repos/finetuning-scheduler on the main branch
Clean working tree (commit or stash uncommitted changes first)
UV package manager installed
Python 3.10+ available
Sufficient disk space for coverage collection (~2GB)
Active environment variables: FTS_VENV_BASE, FTS_TARGET_VENV, FTS_REPO_DIR

Step-by-Step Process

Phase 1: Gather Context and Validate Inputs

Verify working directory:

cd ~/repos/finetuning-scheduler
git status  # Ensure on main branch with clean working tree

Prompt user for missing inputs if not provided:
- Current version (read from src/finetuning_scheduler/__about__.py)
- New version
- New PyTorch min/max versions
- New nightly version
- New CUDA version
- Optional: CUDA arch list, Lightning versions
Validate inputs:
- Version format: semantic versioning (X.Y.Z)
- PyTorch versions: min < max
- CUDA version: major.minor.patch format
- Nightly format: X.Y.Z.devYYYYMMDD

Phase 2: Update Version Metadata Files

Update the following files with version changes:

Core Version Files

src/finetuning_scheduler/__about__.py:
```
__version__ = "{new_version}.dev0"
```
CITATION.cff:
```
version: {new_version}
```
CHANGELOG.md:
- Add new version section at top:
```
## [{new_version}] - 2026-XX-XX

### Added

### Fixed

### Changed

### Deprecated
```
- Update previous version release date if not set

PyTorch Version Files

src/finetuning_scheduler/dynamic_versioning/utils.py:

BASE_DEPENDENCIES = [
    "torch>={new_pytorch_min}",
]

pyproject.toml:

[tool.fts.min-versions]
torch = ">={new_pytorch_min}"

requirements/ci/torch-pre.txt:

{new_pytorch_max}.{nightly_version}
cu{cuda_major}{cuda_minor}0
nightly

Docker Configuration Files

dockers/base-cuda/Dockerfile:

ARG CUDA_VERSION={new_cuda_version}
ARG PYTORCH_VERSION={new_pytorch_max}
ENV TORCH_CUDA_ARCH_LIST="{new_cuda_arch_list}"

Update nightly installation line (if another test or stable version of the line exists, comment those out and uncomment the nightly line):

uv pip install --prerelease=allow torch=={new_pytorch_max}.{nightly_version} --index-url https://download.pytorch.org/whl/nightly/cu{cuda_major}{cuda_minor}0

dockers/fts-az-base/Dockerfile:
```
ARG PYTORCH_VERSION={new_pytorch_max}
```
dockers/release/Dockerfile:
```
ARG PYTORCH_VERSION={new_pytorch_max}
```

dockers/docker_images_main.sh:

declare -A iv=(["cuda"]="{new_cuda_version}" ["pytorch"]="{new_pytorch_max}" ...)

dockers/docker_images_release.sh:

declare -A iv=(["cuda"]="{new_cuda_version}" ["pytorch"]="{new_pytorch_max}" ...)

GitHub Workflows and CI Files

.github/workflows/release-docker.yml:

pytorch_version: ["{new_pytorch_max}"]
cust_base: ["cu{new_cuda_version}-"]

.azure-pipelines/gpu-tests.yml:

image: "speediedan/finetuning-scheduler:py3.13-pt{new_pytorch_max}-pl2.6-azpl-init"

Documentation Files

docs/source/versioning.rst:

Add new version row to compatibility table:

* - {new_version}.x
  - {new_pytorch_min}
  - {new_pytorch_max}
  - >= {lightning_min}

docs/source/install/dynamic_versioning.rst:
- Update example torch version in comments
- Update CUDA target examples
.github/copilot-instructions.md:
- Update minimum PyTorch version in "Key Technologies"
- Update example installation commands with new versions
README.md:
- Update example versions in installation instructions
- Update torch-pre.txt format examples
- Update build status table if applicable (usually is applicable, we will have new oldest and newest tested versions of pytorch with potentially new CUDA versions)
tests/README.md:
- Update example installation commands
.github/CONTRIBUTING.md:
- Update example installation commands
.github/ISSUE_TEMPLATE/bug_report.md:
- Update example versions in environment section

Build Scripts

scripts/build_fts_env.sh:
- Update default torch-backend (e.g., cu130)
- Update comments with new version examples
scripts/gen_fts_coverage.sh:
- Update torch-backend comments and defaults
scripts/infra_utils.sh:
- Update CUDA target comments in get_torch_index_url()
requirements/utils/lock_ci_requirements.sh:
- Update comments with new CUDA targets in manual installation examples

Phase 3: Regenerate CI Requirements

After updating version files, regenerate locked requirements:

cd ~/repos/finetuning-scheduler
source /mnt/cache/${USER}/.venvs/fts_latest/bin/activate
./requirements/utils/lock_ci_requirements.sh

Expected outputs:

requirements/ci/requirements.txt (highest resolution)
requirements/ci/requirements-oldest.txt (lowest resolution)
requirements/ci/torch-override.txt (torch prerelease override)

Validation:

Check that torch version in requirements.txt matches expectations
Verify torch-override.txt contains correct prerelease version
Ensure no unexpected dependency changes

Phase 4: Rebuild Development Environment

Rebuild the development environment with new dependencies:

~/repos/finetuning-scheduler/scripts/manage_standalone_processes.sh --use-nohup \
  ~/repos/finetuning-scheduler/scripts/build_fts_env.sh \
  --repo-home=${HOME}/repos/finetuning-scheduler \
  --target-env-name=fts_latest \
  --venv-dir=/mnt/cache/${USER}/.venvs

Monitor progress:

tail -f $(ls -rt /tmp/build_fts_env_* | tail -1)

Expected duration: ~5 minutes (mostly to download new torch version)

Validation:

Build completes without errors
New PyTorch version is installed
UV doesn't report hardlink warnings (if venv on same filesystem as UV cache)

Phase 5: Collect Full Coverage

Run comprehensive test coverage collection (runs in background):

~/repos/finetuning-scheduler/scripts/manage_standalone_processes.sh --use-nohup \
  ~/repos/finetuning-scheduler/scripts/gen_fts_coverage.sh \
  --repo-home=${HOME}/repos/finetuning-scheduler \
  --target-env-name=fts_latest \
  --venv-dir=/mnt/cache/${USER}/.venvs \
  --no-rebuild-base \
  --allow-failures

Monitor progress:

tail -f $(ls -rt /tmp/gen_fts_coverage_fts_* | tail -1)

Expected duration: ~30 minutes

Handle Test Failures

New leaked variables:

If PyTorch or Lightning leaks new environment variables, add to tests/conftest.py:

@pytest.fixture(scope="function", autouse=True)
def restore_env_variables():
    allowlist = {
        # ... existing entries ...
        "NEW_LEAKED_VAR",  # leaked by torch/lightning {version}
    }

Other errors:

Document all errors in the upgrade report
Most errors should be addressed manually post-upgrade
Include error messages, affected tests, and potential fixes

Phase 6: Rebuild and Validate Documentation

Clean and rebuild docs:

export FTS_VENV_BASE=/mnt/cache/${USER}/.venvs
export FTS_TARGET_VENV=fts_latest
export FTS_REPO_DIR=${HOME}/repos/finetuning-scheduler

cd ${FTS_REPO_DIR} && source ${FTS_VENV_BASE}/${FTS_TARGET_VENV}/bin/activate
cd docs && make clean
make html --debug SPHINXOPTS="-W --keep-going"

Run linkcheck:

make linkcheck SPHINXOPTS="-W --keep-going"
grep -i "error\|broken" build/linkcheck/output.txt || echo "No errors found"

Expected results:

No build warnings or errors
No broken links
All cross-references resolve correctly

Document issues:

Any new warnings or errors
Broken links (may need URL updates)
Missing cross-references

Phase 7: Generate Upgrade Report

Create comprehensive report at the determined output location (default: ~/repos/distributed-insight/project_admin/finetuning-scheduler/handoff_docs/fts_version_upgrade_{new_version}_YYYYMMDD.md):

Report Structure

# FTS Version Upgrade Report: {old_version} → {new_version}

**Date**: YYYY-MM-DD
**Agent**: [Your name/identifier]
**Duration**: [Total time taken]

## Executive Summary

- FTS version upgraded from {old_version} to {new_version}
- PyTorch support updated: {old_pytorch_min}-{old_pytorch_max} → {new_pytorch_min}-{new_pytorch_max}
- CUDA version updated: {old_cuda} → {new_cuda}
- Coverage: X% (change from baseline)
- Status: [All tests passing / N failures documented below]

## Changes Made

### Version Metadata Updates
- [List all files changed with brief description]

### Dependency Updates
- PyTorch minimum: {old} → {new}
- PyTorch maximum: {old} → {new}
- PyTorch nightly: {old} → {new}
- CUDA version: {old} → {new}
- [Any other dependency changes]

### Docker Configuration
- Base image CUDA: {old} → {new}
- PyTorch version: {old} → {new}
- CUDA arch list: [changes if any]

## Build and Test Results

### Environment Rebuild
- Status: [Success/Failure]
- Duration: [X minutes]
- Issues: [None / list issues]

### Coverage Collection
- Status: [Success/Partial/Failure]
- Duration: [X minutes]
- Total coverage: X%
- Coverage change: [+/- X%]

### Test Failures
[If any tests failed]

#### Leaked Variables Fixed
- `VARIABLE_NAME`: leaked by torch/lightning {version} - [added to allowlist]

#### Outstanding Issues
1. **Test**: test_name
   **Error**: [error message]
   **Proposed fix**: [suggestion]

### Documentation Build
- Status: [Success/Failure]
- Warnings: [count]
- Broken links: [count]
- Issues: [None / list issues]

## Recommendations

### Immediate Actions Required
- [List any critical issues that need manual intervention]

### Follow-up Tasks
- [List non-critical improvements or cleanups]

### Skill Improvements
[Suggestions for improving this skill based on issues encountered]

## Validation Checklist

- [ ] All version files updated
- [ ] CI requirements regenerated
- [ ] Development environment rebuilt successfully
- [ ] Coverage collected (with acceptable failure rate)
- [ ] Documentation builds without errors
- [ ] No new broken links
- [ ] Leaked variables handled
- [ ] Report generated

## Files Modified

[Complete list of files changed during upgrade]

## Next Steps

1. Review and address outstanding test failures
2. Commit changes with message: "Bump version to {new_version}, update PyTorch to {new_pytorch_max}"
3. Create PR for version upgrade
4. Monitor CI runs for any platform-specific issues

## Appendix

### Full Coverage Output
[Attach or reference full coverage report]

### Build Logs
[Reference to build log locations]

### Test Error Details
[Full stack traces for failed tests]

Phase 8: Propose Skill Improvements

If any unexpected issues were encountered, suggest updates to this skill:

Common improvement areas:

New files that need version updates
Changed file paths or structures
New validation steps needed
Improved error handling
Better progress monitoring
Additional automation opportunities

Format for suggestions:

## Suggested Skill Improvements

### Addition: [New step/file to handle]
**Reason**: [Why this is needed]
**Implementation**: [How to add it]

### Clarification: [Existing step that was unclear]
**Issue**: [What was confusing]
**Proposed update**: [Clearer wording]

### Automation: [Manual step that could be automated]
**Current process**: [What's done manually]
**Proposed automation**: [How to automate]

Error Handling

Common Issues and Solutions

Issue: UV hardlink warnings during environment build Solution: Ensure venv is on same filesystem as UV cache (use --venv-dir flag)

Issue: Torch prerelease not found Solution: Verify nightly version exists at PyTorch download site, may need to use different date

Issue: Coverage collection hangs Solution: Check for conflicting pytest processes, kill if >40 minutes old

Issue: Documentation build fails with missing references Solution: Check for typos in cross-references, ensure all referenced sections exist

Issue: Locked requirements have unexpected versions Solution: Check pyproject.toml constraints, may need to update dependency pins

Validation Steps

After completing all phases, verify:

Version consistency: All files reference new version correctly
Build success: Environment builds without errors
Test status: Coverage collected (failures documented)
Documentation: Builds cleanly with no warnings
Git status: All changes tracked, ready to commit

Notes

Always run on a clean working tree (commit or stash first)
Use --allow-failures for initial coverage run to capture all issues
Monitor log files to catch issues early
Keep old environment as backup until new one validated
Document all deviations from expected behavior