name: ossd-bulk-update description: Use when importing or updating many projects at once from external data sources (APIs, JSON, CSV), or building collections from external rankings and lists
Bulk Update Projects
You help maintainers perform bulk updates to the oss-directory using data from external sources like APIs, JSON files, or spreadsheets. This skill handles systematic updates to many projects at once, ensuring data consistency and proper validation.
When to Use This Skill
Use this skill when:
- Importing/updating projects from external data sources (APIs, JSON, CSV)
- Adding missing metadata to multiple existing projects
- Creating collections based on external rankings or lists (e.g., "Top 50 by TVL")
- Synchronizing oss-directory with external data providers
- Bulk enrichment of project data (adding DefiLlama, npm, blockchain addresses, etc.)
Do not use this skill when:
- Adding a single project (use
add-projectskill) - Updating a single project (use
update-projectskill) - Manual one-off updates without external data
Core Workflow
0. Working Directory
All commands assume you're in the oss-directory root. If not already there: cd ~/GitHub/oss-directory
1. Obtain and Prepare External Data
Download or fetch the data:
# Example: DefiLlama protocols API
curl -o /tmp/protocols_data.json https://api.llama.fi/protocols
# Example: Custom API with authentication
curl -H "Authorization: Bearer $TOKEN" \
-o /tmp/data.json \
https://api.example.com/projects
# Example: Using existing file
cp /path/to/data.csv /tmp/input_data.csv
Understand the data structure:
- Read the file to understand field mappings
- Identify unique identifiers (slugs, GitHub URLs, etc.)
- Note which fields map to oss-directory schema fields
2. Plan the Bulk Update
Create a task list to track progress using TaskCreate for each step:
- Download and analyze external data source
- Map external data to oss-directory projects
- Create new projects for missing entries
- Update existing projects with new data
- Validate all changes
- Commit changes
Update tasks as you progress — set in_progress when starting, completed when done.
3. Create Systematic Mapping
Map external data identifiers to oss-directory project names:
Strategy A: Use GitHub URLs
# Extract GitHub org from external data
cat /tmp/data.json | jq -r '.[] | select(.github != null) | .github' | \
sed 's|https://github.com/||' | sed 's|/.*||' > /tmp/github_orgs.txt
# Check which exist in oss-directory
while read org; do
if find data/projects -name "$org.yaml" -type f 2>/dev/null | grep -q .; then
echo "EXISTS: $org"
else
echo "MISSING: $org"
fi
done < /tmp/github_orgs.txt
Strategy B: Use name/slug mapping
# Create mapping file
cat /tmp/data.json | jq -r '.[] | "\(.slug),\(.github)"' > /tmp/mapping.csv
# Search for existing projects by slug
cat /tmp/mapping.csv | while IFS=, read slug github; do
# Try exact match
if find data/projects -name "$slug.yaml" | grep -q .; then
echo "MATCH: $slug"
else
# Try fuzzy match
find data/projects -name "*$slug*.yaml"
fi
done
Strategy C: Manual mapping for complex cases
# Log all mappings to a file for review
echo "external_id,oss_directory_name,action,notes" > /tmp/bulk_mapping.csv
# Fill this in as you process each entry
4. Process Each Entry Systematically
For each item in the external data:
Step A: Determine if project exists
# Check by GitHub org
find data/projects -name "[org-name].yaml"
# Check by slug/name
find data/projects -name "[slug].yaml"
# Grep for GitHub URL
grep -r "github.com/[org-name]" data/projects/
Step B: If project exists → Update it
# Read current project file
cat data/projects/[letter]/[project-name].yaml
# Use Edit tool to add/update fields:
# - defillama URLs
# - npm packages
# - blockchain addresses
# - social media
# - websites
Step C: If project doesn't exist → Create it
# Determine project name from GitHub org (preferred)
# File path: data/projects/[first-letter]/[project-name].yaml
# Create minimal valid project:
version: 7
name: [project-name]
display_name: [Display Name]
description: [1-2 sentence description]
websites:
- url: [website-url]
github:
- url: https://github.com/[org-name]
[additional-fields-from-data]
CRITICAL RULES:
- ✅ GitHub org URLs only - Use
https://github.com/org-nameNOThttps://github.com/org-name/repo-name - ✅ Match project name to GitHub org - If org is
babylonlabs-io, name must bebabylonlabs-io - ✅ Validate external URLs - Test that URLs work before adding (especially CEX vs protocol URLs)
- ✅ Comment out invalid URLs - Don't remove data, comment it out for future reference
- ✅ Log all mappings - Keep track of what you're doing in
/tmp/mapping_log.txt
5. Manual Research Strategy (When No API Available)
When building a collection without an API (e.g., "Top 50 Banks"):
Step A: Search by Category
# Instead of searching for individual companies, search by industry category:
# - "fintech neobanks GitHub organization"
# - "insurance companies GitHub organization"
# - "credit bureaus GitHub organization"
# - "investment firms GitHub organization"
Use web searches to:
- Identify companies in each category
- Find their GitHub organizations
- Verify they have meaningful open source presence (public repos, not just private orgs)
Step B: Verify GitHub Presence
# Not all major companies have usable GitHub orgs. Check:
# ✓ Does the GitHub org have public repositories?
# ✓ Are there 3+ public repos with meaningful content?
# ✓ Is it the official org (not third-party or fan accounts)?
# Examples of companies WITHOUT useful public GitHub orgs:
# - Wells Fargo (private org, no public repos)
# - E*TRADE (no official org found)
# - DBS Bank (no official org found)
# - OCBC Bank (no official org found)
Skip companies that don't have public GitHub presence - they don't belong in oss-directory.
Step C: Handle Multiple Orgs per Institution
Many large institutions have multiple related GitHub organizations:
# Example: Experian
github:
- url: https://github.com/experianplc # Main org (23 repos)
- url: https://github.com/experiandataquality # Division/product org
# Example: MetLife
github:
- url: https://github.com/MetLife # Corporate (6 repos)
- url: https://github.com/MetLifeLegalPlans # Subsidiary (25 repos)
# Example: TransUnion
github:
- url: https://github.com/transunion # Main org
- url: https://github.com/Trustev # Acquisition
- url: https://github.com/iovation # Global Fraud Solutions
When to include multiple orgs:
- Different divisions/products of same company
- Acquisitions that maintained separate GitHub presence
- Regional/business unit organizations
When NOT to include:
- Personal repos of employees
- Third-party integrations
- Unrelated projects with similar names
Step D: GitHub Org Naming Variations
GitHub org names don't always match company names exactly:
| Company | GitHub Org | Reason |
|---|---|---|
| Wise | transferwise |
Former company name |
| Chime | 1debit |
Technical/corporate entity |
| Vanguard | Vanguard-oss |
Dedicated OSS org |
| Xero | xeroapi |
Developer/API focused org |
| State Farm | StateFarmIns |
Abbreviated form |
How to find the correct org:
- Check company's developer portal or open source page
- Search: "[Company Name] GitHub organization"
- Look for mentions in blog posts or job postings
- Check repos for official branding/copyright
Step E: Description Standards for Manual Entries
When creating descriptions for manually researched projects:
# Include these elements:
description: [Company Type] is a/an [Nationality] [Industry] company providing [Services]. [Company] has [N] repositories on GitHub [notable projects if known]
# Examples:
# With repo count and notable projects:
description: State Farm Mutual Automobile Insurance Company is an American insurance and financial services company providing insurance and financial services. State Farm has 175 repositories on GitHub including CLAWS, TheThingStore, and Terraform modules
# With repo count only:
description: Wise (formerly TransferWise) is a British financial technology company providing international money transfer services and multi-currency accounts. Wise has 224 repositories on GitHub with open source tools and libraries
# With general description (when repo count unavailable):
description: Nubank is a Brazilian neobank and financial technology company providing banking services, credit cards, and personal loans through mobile applications. Nubank has extensive GitHub presence with open source projects including fklearn and other machine learning tools
Where to get repo counts:
- GitHub org page shows public repository count
- Can be found via GitHub API:
https://api.github.com/orgs/[org-name]
Step F: Category-Based Target Completion
When working toward a specific collection size (e.g., "Top 50"):
Strategy:
- Start with obvious/largest companies in core category
- Track progress toward goal (e.g., 32/50)
- Expand to adjacent categories systematically
- Search by category, not individual companies
Example progression for "Top 50 TradFi":
- Start: Top 10 US banks → 10 projects
- Round 1: More US banks + payment processors → 17 projects
- Round 2: European banks + fintech → 26 projects
- Round 3: Neobanks + financial software → 32 projects
- Round 4: Insurance companies → 38 projects
- Round 5: Credit bureaus + remaining → 50 projects ✓
Categories to consider for financial institutions:
- Traditional banks (retail, commercial, investment)
- Neobanks/digital banks (Monzo, Nubank, Chime)
- Payment processors (Stripe, Adyen, Square)
- Fintech platforms (Plaid, Wise, Robinhood)
- Investment firms (Vanguard, BlackRock, Fidelity)
- Insurance companies (AIG, MetLife, State Farm, Liberty Mutual)
- Credit bureaus (Equifax, Experian, TransUnion)
- Financial software (Xero, Intuit, Bloomberg)
6. Handle Special Cases
DefiLlama URLs (CEX vs Protocol)
DefiLlama has two URL patterns that cannot be mixed:
- Protocol URLs:
https://defillama.com/protocol/[slug]- Valid in schema - CEX URLs:
https://defillama.com/cex/[slug]- NOT valid in schema (comment out)
# CEXs (centralized exchanges) - comment out the /cex/ URL
defillama:
# - url: https://defillama.com/cex/binance # Not valid in schema
- url: https://defillama.com/protocol/binance-staked-eth # Valid protocol URL
# Protocols - use /protocol/ URLs
defillama:
- url: https://defillama.com/protocol/uniswap-v3
- url: https://defillama.com/protocol/uniswap-v2
How to check: Look for category: "CEX" field in the data source.
Blockchain Addresses
Validate tags before adding:
# Check on Etherscan
curl "https://etherscan.io/address/0x..."
# If it's a contract, use tags: [contract]
# If it's an EOA, use tags: [eoa]
# If it's a deployer, use tags: [deployer]
blockchain:
- address: "0x1234567890123456789012345678901234567890"
networks:
- mainnet # Use specific network, NOT "any_evm"
tags:
- contract # or: eoa, deployer, safe, wallet
Multiple Protocols per Project
Some projects have multiple protocols under one GitHub org:
# Example: Hyperliquid has multiple protocols
defillama:
- url: https://defillama.com/protocol/hyperliquid-bridge
- url: https://defillama.com/protocol/hyperliquid-hlp
- url: https://defillama.com/protocol/hyperliquid-spot-orderbook
- url: https://defillama.com/protocol/hyperliquid-perps
Check the data source for all protocols associated with the same GitHub org.
7. Validation Strategy
Incremental validation - Don't wait until the end:
# After every 10-20 updates, validate
pnpm run validate
# If errors, fix immediately
# Common issues:
# - Empty defillama arrays (remove entire section or comment it out)
# - Duplicate YAML keys (merge them)
# - Invalid URLs (fix or comment out)
# - Wrong indentation (run pnpm prettier:write)
Batch Processing Strategy: When creating many files manually:
- Create 5-10 project files
- Run
pnpm run validateto catch errors early - Fix any issues immediately
- Continue with next batch
- Commit every 10-20 successful files to avoid losing work
Collection Alphabetical Sorting: Collections MUST have projects in alphabetical order:
# ✓ CORRECT - Alphabetically sorted
projects:
- aave
- babylonlabs-io
- binance
- compound
# ✗ WRONG - Not alphabetically sorted
projects:
- compound
- aave
- binance
- babylonlabs-io
The validation will pass even if unsorted, but it's required by convention. When updating a collection with new projects:
- Add all new project names to the list
- Re-sort the entire list alphabetically
- Verify the count matches your goal (e.g., 50 projects)
Pro tip: Use shell to verify alphabetical order:
# Check if collection is alphabetically sorted
cat data/collections/your-collection.yaml | \
grep "^ - " | \
sort -c 2>&1 && echo "✓ Sorted" || echo "✗ Not sorted"
Final validation:
pnpm run validate
pnpm lint
pnpm build
pnpm test
8. Create/Update Collection (If Applicable)
If the bulk update is for a specific list/ranking:
# Example: data/collections/defillama-top-50-protocols.yaml
version: 7
name: defillama-top-50-protocols
display_name: DefiLlama Top 50 Protocols
description: Top 50 DeFi protocols by TVL from DefiLlama API
projects:
- aave
- babylonlabs-io
- binance
# ... (alphabetically sorted)
9. Commit Strategy
Option A: Single commit (for smaller bulk updates)
git add data/projects/ data/collections/
git commit -m "Bulk update: Add DefiLlama URLs to 50 DeFi projects
- Created 10 new project files
- Updated 40 existing projects with DefiLlama URLs
- Added defillama-top-50-protocols collection
Data source: DefiLlama API (api.llama.fi/protocols)
"
Option B: Multiple commits (for large/complex updates)
# Commit 1: New projects
git add data/projects/*/[new-projects].yaml
git commit -m "Add 10 new DeFi projects from DefiLlama top 50"
# Commit 2: Update existing
git add data/projects/*/[updated-projects].yaml
git commit -m "Add DefiLlama URLs to 40 existing projects"
# Commit 3: Collection
git add data/collections/defillama-top-50-protocols.yaml
git commit -m "Add DefiLlama top 50 protocols collection"
Include in commit message:
- What was updated
- How many projects affected (created vs updated)
- Data source and date
- Any caveats or known issues
Common Data Sources
DefiLlama API
# All protocols
curl https://api.llama.fi/protocols > /tmp/defillama_protocols.json
# Mappings to explore:
# - slug → project name
# - github → GitHub org
# - url → DefiLlama protocol URL
# - category → "CEX" or protocol type
# - tvl → Total Value Locked (for ranking)
GitHub API (for org/repo data)
# Get org info
curl https://api.github.com/orgs/[org-name]
# List org repos
curl https://api.github.com/orgs/[org-name]/repos
CSV/Spreadsheet Data
# Convert CSV to JSON for easier processing
cat data.csv | \
jq -R 'split(",")' | \
jq -s 'map({name: .[0], github: .[1], website: .[2]})'
Error Handling Patterns
Duplicate Projects Found
# If you find an existing project during bulk import
# Check if it's truly a duplicate or should be merged
# Option 1: Update existing project with new data
# Option 2: Skip if data already exists
# Option 3: Merge if different data sources
# Log the decision:
echo "Skipped [external-id]: Already exists as [oss-project-name]" >> /tmp/skipped.log
Invalid or Missing GitHub URLs
# If GitHub URL is missing from external data but you find it:
# Add it manually and document:
# Missing in API, found manually: https://github.com/[org]
github:
- url: https://github.com/[org]
Validation Failures
# Common fixes:
# 1. Empty arrays (defillama with only commented URLs)
# Remove the entire section or ensure at least one valid URL
# 2. Duplicate keys in YAML
# Merge the duplicate sections
# 3. Invalid URLs
# Test each URL and comment out ones that 404
# 4. Wrong schema fields
# Check src/resources/schema/project.json for valid fields
Best Practices
1. Work Incrementally
- Process data in batches of 10-20 projects
- Validate after each batch
- Commit working batches before moving to next
2. Keep Audit Trails
# Log everything you do
echo "$(date): Started processing DefiLlama top 50" >> /tmp/bulk_update.log
echo "Created: babylonlabs-io.yaml (rank #2, $71B TVL)" >> /tmp/bulk_update.log
echo "Updated: aave.yaml - added DeFiLlama URLs" >> /tmp/bulk_update.log
3. Document Edge Cases
# In commit messages or as YAML comments:
# Comment in YAML:
# Note: CEX URL not valid in schema, commented out as of 2025-01
defillama:
# - url: https://defillama.com/cex/binance
# In commit message:
# Note: 5 projects missing GitHub orgs were not added to collection
4. Verify External Data Quality
# Before trusting external data, spot-check it:
# Check if GitHub URLs are valid
cat /tmp/data.json | jq -r '.[].github' | head -5 | while read url; do
curl -I "$url" 2>&1 | grep -q "200 OK" && echo "✓ $url" || echo "✗ $url"
done
# Check if websites are live
cat /tmp/data.json | jq -r '.[].website' | head -5 | while read url; do
curl -I "$url" 2>&1 | grep -q "200 OK" && echo "✓ $url" || echo "✗ $url"
done
5. Update Tasks Throughout
Update your task list after completing each major step:
- Mark tasks completed immediately using
TaskUpdate - Add new tasks with
TaskCreateif you discover issues - Keep only one task
in_progressat a time
Examples: Complete Bulk Update Workflows
Example 1: Import Top 50 DeFi Protocols from DefiLlama (API-Based)
1. Fetch data
curl https://api.llama.fi/protocols > /tmp/defillama_protocols.json
2. Create task list (using TaskCreate for each step)
- Download DefiLlama API data
- Map top 50 protocols to oss-directory
- Create missing projects
- Update existing projects
- Create collection
- Validate all changes
- Commit
3. Analyze and map
# Get top 50 by TVL
cat /tmp/defillama_protocols.json | \
jq 'sort_by(-.tvl) | .[0:50] | .[] | {slug, name, github, tvl}' > /tmp/top50.json
# Check which exist
cat /tmp/top50.json | jq -r '.github' | while read gh; do
org=$(echo $gh | sed 's|https://github.com/||' | cut -d'/' -f1)
find data/projects -name "$org.yaml" | grep -q . && echo "EXISTS: $org" || echo "NEW: $org"
done
4. Process systematically
# For each of the 50:
# - If exists: Use Edit to add defillama URLs
# - If missing: Use Write to create new project file
# - Log each action
5. Create collection
# data/collections/defillama-top-50-protocols.yaml
version: 7
name: defillama-top-50-protocols
display_name: DefiLlama Top 50 Protocols
description: Top 50 DeFi protocols by TVL
projects:
- [all-50-project-names-alphabetically]
6. Validate incrementally
# After every 10 projects
pnpm run validate
7. Commit
git add data/projects/ data/collections/defillama-top-50-protocols.yaml
git commit -m "Add DefiLlama Top 50 Protocols
- Created 10 new projects
- Updated 40 existing projects with DefiLlama URLs
- Collection tracks top 50 DeFi protocols by TVL
- Data source: api.llama.fi/protocols (2025-01-19)
"
Example 2: Build Top 50 TradFi Collection (Manual Research)
Scenario: Create collection of 50 traditional finance institutions with GitHub presence
1. Create task list (using TaskCreate for each step)
- Search for fintech companies (Wise, SoFi, Monzo, Chime)
- Search for insurance companies
- Search for credit bureaus
- Search for investment firms
- Create all project files
- Update collection to 50
- Validate and commit
2. Research by category
# Search: "Wise TransferWise GitHub organization"
# Found: github.com/transferwise (224 repos)
# Search: "Monzo bank GitHub organization"
# Found: github.com/monzo (174 repos)
# Search: "Chime bank GitHub organization"
# Found: github.com/1debit (47 repos)
# Search: "State Farm GitHub organization"
# Found: github.com/StateFarmIns (175 repos)
# For each: Verify public repos exist and are meaningful
3. Track progress toward goal
# After fintech batch: 32 → 36 (need 14 more)
# After insurance batch: 36 → 42 (need 8 more)
# After credit bureaus: 42 → 45 (need 5 more)
# After investment firms: 45 → 50 ✓
4. Create files in batches
# Batch 1: Create 5 fintech projects
# Validate: pnpm run validate
# Batch 2: Create 6 insurance projects
# Validate: pnpm run validate
# Continue until 50 projects created
5. Create project files with standard format
# Example: Wise
version: 7
name: wise
display_name: Wise
description: Wise (formerly TransferWise) is a British financial technology company providing international money transfer services and multi-currency accounts. Wise has 224 repositories on GitHub with open source tools and libraries
websites:
- url: https://wise.com
github:
- url: https://github.com/transferwise
social:
twitter:
- url: https://twitter.com/Wise
6. Update collection (alphabetically sorted)
# data/collections/tradfi-top-50-banks.yaml
version: 7
name: tradfi-top-50-banks
display_name: Traditional Finance Institutions
description: Traditional financial services institutions including banks, investment banks, asset managers, payment processors, and fintech companies with active open source projects on GitHub
projects:
- adyen
- affirm
- aig
- ally
# ... (all 50 in alphabetical order)
- wise
- xero
7. Commit
git add data/projects/ data/collections/tradfi-top-50-banks.yaml
git commit -m "Complete TradFi Top 50 collection with 18 new institutions
Added 18 new traditional finance institutions to reach 50 total:
Fintech & Neobanks:
- Wise (224 repos), Monzo (174 repos), Nubank, Chime (47 repos)
Investment & Trading:
- Vanguard, Interactive Brokers, Bloomberg
Insurance:
- State Farm (175 repos), Liberty Mutual (53 repos), AIG, MetLife, Prudential, Nationwide
Credit Bureaus:
- Equifax, Experian (23 repos), TransUnion
Collection includes banks, investment firms, insurance, credit bureaus,
payment processors, and fintech - all with verified GitHub presence.
"
Key differences from API-based workflow:
- No API to fetch - must manually research each company
- Search by category/industry instead of ranking API
- Verify each GitHub org exists and is public before creating
- Handle org naming variations (Wise→transferwise, Chime→1debit)
- Track progress toward numerical goal (50 projects)
- Include repo counts in descriptions when available
Troubleshooting
Problem: Too many validation errors
Solution: Process in smaller batches. Fix errors immediately rather than batching fixes.
Problem: Unclear how to map external IDs to project names
Solution: Use multiple strategies:
- Check GitHub URL first (most reliable)
- Try slug/name exact match
- Try fuzzy search on display names
- Ask user for manual mapping file
Problem: External data has fields not in schema
Solution:
- Check
src/resources/schema/project.jsonfor valid fields - Add only supported fields
- Log unsupported fields for potential future schema additions
Problem: GitHub orgs vs repos confusion
Solution:
- Always prefer org-level URLs:
https://github.com/org-name - Only use repo URLs if project has no org
- Never mix org and repo URLs unless they're different orgs
Problem: Commit is too large
Solution:
- Break into multiple commits by category:
- New projects
- Updated existing projects
- Collections
- Fixes/corrections
Resources
- oss-directory docs: https://docs.oso.xyz/docs/projects
- Project schema:
src/resources/schema/project.json - Collection schema:
src/resources/schema/collection.json - Example bulk updates: Search git log for "bulk", "import", "add.*protocols"
Related Skills
add-project- For adding individual projectsupdate-project- For updating individual projectsadd-collection- For creating collectionsreview-pr- For reviewing community contributions