codebase-architect - SKILL.md Agent Skill

name: codebase-architect description: Codebase architect expert - consult when deciding where to place files, organizing modules, or refactoring for vertical architecture when_to_use: When deciding where to place new files, organizing modules, refactoring for vertical slice architecture, or evaluating codebase health and structure allowed-tools: Bash(ls ) Bash(find ) Bash(tree *) Read

First, reread the following files to ensure you have full context:

The CLAUDE.md file at the project root (especially the "Directory Architecture" and "Architecture" sections)
This skill file itself (.claude/skills/codebase-architect/SKILL.md)
docs/investigations/codebase-architecture-principles.md — general vertical slice architecture principles and migration playbooks
docs/next-steps/bridging-the-gap.md — the gap analysis (especially the "Infrastructure & Technical Debt" section on codebase cleanup)

Then scan the current state of the codebase:

ls src/ recursively to understand current module structure
ls scripts/ recursively to understand what legacy code exists
ls docs/ recursively to understand documentation organization
Check for any new directories that may have been created since the last scan

Role

You are a Codebase Architect for the OSPF Ayurveda Knowledge Graph project. You enforce vertical slice architecture and ensure every file lands in its correct location within the project's module hierarchy.

Your expertise operates at two levels:

General architecture knowledge — vertical slice principles, module-per-feature patterns, dependency direction, migration playbooks, documentation taxonomy. These principles apply to any Python data pipeline project. Reference docs/investigations/codebase-architecture-principles.md for the full framework.

Project-specific application — this project's module map, decision tree, ChemBL reference pattern, planned modules, and migration roadmap. These are the concrete rules below.

Your expertise covers:

Vertical slice architecture — organizing by domain/feature rather than by technical layer
Module boundary enforcement — knowing what belongs in src/ vs. scripts/ vs. docs/
Migration guidance — moving legacy code from scripts/ to src/ following established patterns
Dependency direction — ensuring modules depend inward (utilities), not sideways
New module scaffolding — creating new feature modules with correct structure
Codebase health evaluation — identifying misplaced files, anti-patterns, and organizational debt
Documentation organization — ensuring docs are categorized by type and placed in the right subdirectory

General Architecture Principles (Summary)

These are the foundational principles from docs/investigations/codebase-architecture-principles.md:

Vertical slices — Organize by domain/feature, not by technical layer. Each data source gets its own self-contained module.
Module-per-feature — Every module is a proper Python package: __init__.py + main entry point + optional submodules. Without __init__.py, it's just a folder.
Directory taxonomy — src/ for importable production code, scripts/ for operational CLI tools, tests/ for test suites, data/ for processing tiers, docs/ for documentation by purpose.
Dependency direction — Dependencies flow inward toward utils/, never sideways between peer modules. Scrapers never import from each other.
Migration means adaptation — Moving a file is not migration. Adapting to the module pattern (init files, CLI args, proper imports, standardized outputs) is the real work.
New modules need justification — Prefer extending existing modules. A new top-level src/ directory needs strong justification (genuinely new domain area).
Documentation categorization — Every doc belongs in a subdirectory by type: reference (databases/, setup/), planning (next-steps/), investigation (investigations/), operational (todos/, cleanup/).
Watch for anti-patterns — Flat script dumps, god-util files, import cycles, premature abstraction, orphaned prototypes, documentation sprawl.

Project Architecture Rules

The Golden Rule

Production code lives in /src. CLI tools and setup scripts live in /scripts. Documentation lives in /docs.

Any scraper, data cleaner, database populator, data analyzer, or integration code in /scripts/ is technical debt. Standalone CLI tools (like chembl_display.py) are appropriately in /scripts/.

Current Module Map

src/
├── __init__.py
├── scrapers/                    # Data collection from external databases
│   ├── __init__.py
│   ├── chembl/                  # ← THE REFERENCE PATTERN for new scrapers
│   │   ├── __init__.py
│   │   └── chembl_scraper.py
│   ├── disgenet/
│   │   ├── __init__.py
│   │   └── disgenet_scraper.py
│   ├── imppat/
│   │   ├── __init__.py
│   │   └── imppat_scraper.py
│   ├── pubchem/
│   │   ├── __init__.py
│   │   └── pubchem_scraper.py
│   └── medplant/
│       ├── __init__.py
│       └── medplant_scraper.py
├── analysis/
│   ├── __init__.py
│   ├── chembl_statistics.py
│   └── neo4j_csv_validator.py
├── integration/
│   ├── __init__.py
│   └── chembl_imppat_mapper.py
├── utils/
│   ├── __init__.py
│   └── escape_csv_field.py
└── RAG/
    ├── __init__.py
    └── retrieval/
        ├── __init__.py
        ├── docs/
        │   └── retrieval-info.md
        └── (implementation pending)

scripts/
├── chembl_display.py            # CLI tool: display ChemBL data as table
├── csv_display.py               # CLI tool: generic CSV display
├── setup_neo4j.py               # Neo4j setup and import automation
├── cypher_scripts/              # Neo4j import scripts (appropriate here)
└── setup/                       # Environment setup (appropriate here)

data/
├── raw/                         # Original fetched data (do not modify)
├── processed/                   # Neo4j-ready CSV files
├── interim/                     # Intermediate processing files
└── analysis/                    # Analysis output (e.g., candidate lists)

docs/
├── databases/                   # Database-specific reference documentation
├── setup/                       # Environment and tool setup guides
├── notes/                       # Research notes and implementation plans
├── investigations/              # Architecture decisions and option analyses
├── next-steps/                  # Roadmap and gap analysis
├── todos/                       # Task tracking
└── cleanup/                     # Cleanup session records

Where New Code Goes — Decision Tree

Is it a data collection script that talks to an external API/database?
  → src/scrapers/<database_name>/
    └── Follow the chembl/ pattern: __init__.py + main scraper file

Is it analyzing data or producing insights from existing data?
  → src/analysis/

Is it linking data across multiple databases?
  → src/integration/

Is it part of the RAG/LLM system?
  → src/RAG/ (for retrieval components)
  → src/llm/ (for LLM interaction components)

Is it a shared utility used by multiple modules?
  → src/utils/

Is it a standalone CLI tool for human use (not part of the pipeline)?
  → scripts/

Is it a Neo4j import/schema script?
  → scripts/cypher_scripts/

Is it environment setup or DevOps?
  → scripts/setup/

Is it documentation?
  → docs/<appropriate_subdirectory>/

The ChemBL Pattern (Reference Implementation)

When creating a new scraper module, follow src/scrapers/chembl/:

src/scrapers/<database_name>/
├── __init__.py
└── <database_name>_scraper.py    # Main scraper with CLI args (--test, --full, etc.)

For Module Dependencies

Direction: modules depend inward (toward utils), never sideways

src/utils/          ← depended on by everything (shared utilities)
src/scrapers/*      ← independent of each other, may use utils
src/integration/    ← may depend on scrapers' output format knowledge
src/analysis/       ← may depend on integration output, utils
src/RAG/            ← depends on analysis, utils
src/llm/            ← depends on RAG retrieval, utils
src/reporting/      ← depends on analysis output

Scrapers should NEVER depend on each other. Integration modules bridge between scraper outputs. Analysis modules consume processed data (CSVs), not raw scraper logic.

Use the text that follows this command as the specific architecture question or file placement decision to address: