code-research - SKILL.md Agent Skill

name: code-research description: Code Research Agent. Use when relevant to this domain. domain: agents

Code Research Agent

Autonomous codebase analysis agent that produces structured understanding of unfamiliar code: architecture, data flows, dependencies, conventions, and entry points. This agent reads code systematically -- not randomly browsing files, but following a deliberate investigation protocol.

When to Use

Joining a new project and need to understand the codebase
Investigating how a feature is implemented across multiple files
Tracing a data flow from input to output
Understanding dependency chains before making changes
Finding all callers of a function or users of a module
Mapping the architecture of a monolith or microservice
Preparing a technical design document that references existing code

When NOT to Use

Implementing new features (use code-agent)
Reviewing code quality (use review-agent)
Refactoring code (use refactor-agent)
Writing tests (use test-agent)
Researching external documentation (use web-research)
Analyzing market or competitors (use market-research-agent)
Code is trivially simple (single file, obvious structure)
You already know where everything is
Real-time debugging (use systematic-debugging)

Process / Steps

Follow these steps in order. Each step builds on the previous one.

1. Topology Scan (Map the Terrain)

Start broad, narrow down:

# Project structure (2 levels deep)
find . -maxdepth 2 -type f -name "*.json" -o -name "*.toml" -o -name "*.yaml" | head -20

# Entry points
ls -la src/main.* src/index.* src/app.* 2>/dev/null

# Configuration files
cat package.json | jq '{name, dependencies, scripts}'
cat pyproject.toml | grep -A 50 '\[tool\|dependencies'
cat Cargo.toml | head -40

# Directory purpose map
for dir in src/*/; do
    echo "=== $dir ==="
    ls "$dir" | head -10
done

Produce a topology map:

## Project Topology
- **Language**: [language + version]
- **Framework**: [framework + version]
- **Entry point**: [file]
- **Key directories**:
  - `src/core/` - [business logic]
  - `src/api/` - [HTTP endpoints]
  - `src/db/` - [database layer]
  - `src/utils/` - [shared utilities]
  - `tests/` - [test suite]
- **Config files**: [list with purpose]

2. Dependency Analysis (What Does It Use)

# Direct dependencies
cat package.json | jq '.dependencies'
# OR
grep -r "import " src/ | sed 's/.*import //' | sort | uniq -c | sort -rn | head -20

# Internal module graph
grep -r "from \." src/ | sed 's/.*from //' | sort | uniq -c | sort -rn | head -20

# External API calls
grep -r "fetch\|axios\|requests\.\|http\." src/ | head -20

# Database access patterns
grep -r "SELECT\|INSERT\|UPDATE\|DELETE\|\.query\|\.find\|\.create" src/ | head -20

3. Data Flow Tracing

Follow data from input to output:

## Data Flow: [Feature Name]
1. **Entry**: [HTTP request / CLI input / event]
2. **Validation**: [file:line] -- [what is validated]
3. **Processing**: [file:line] -- [what transformation]
4. **Storage**: [file:line] -- [what database/filesystem]
5. **Response**: [file:line] -- [what is returned]
6. **Side effects**: [file:line] -- [events fired, emails sent, etc.]

### Flow Diagram

[Client] -> [Router: src/api/routes.ts:42] -> [Service: src/services/user.ts:15] -> [Repository: src/db/users.ts:30] -> [PostgreSQL: users table] <- [Response: {id, name, email}]

4. Convention Discovery

Identify patterns the codebase follows:

## Code Conventions

Document naming, imports, error handling, and testing patterns.

### Naming
- Variables: [camelCase / snake_case / PascalCase]
- Functions: [camelCase / snake_case]
- Classes: [PascalCase]
- Constants: [UPPER_SNAKE_CASE]
- Files: [kebab-case / PascalCase / snake_case]

### Import Style
- [Named imports / default imports / wildcard]
- [Grouped: stdlib, external, internal]

### Error Handling
- [Try/catch / Result types / Error middleware]
- [Custom error classes / plain Error / HTTP errors]

### Testing
- [Framework]: [jest / pytest / go test]
- [File naming]: [*.test.ts / test_*.py / *_test.go]
- [Pattern]: [describe/it / class-based / flat functions]
- [Mocking]: [jest.mock / unittest.mock / testdouble]

### State Management
- [Where state lives]: [in-memory / database / cache / message queue]
- [How state flows]: [request-scoped / singleton / event-driven]

5. Hotspot Detection

Find the most-changed, most-complex, most-coupled code:

# Most-changed files (likely have bugs or need refactoring)
git log --pretty=format: --name-only | sort | uniq -c | sort -rn | head -20

# Most complex (by function count per file)
grep -c "def \|function \|func " src/**/*.py src/**/*.ts 2>/dev/null | sort -t: -k2 -rn | head -20

# Most imported (highest fan-in = most depended upon)
grep -r "from.*import\|require(" src/ | sed 's/.*from //;s/.*require(//' | sort | uniq -c | sort -rn | head -20

# Largest files (by line count)
wc -l src/**/*.ts src/**/*.py 2>/dev/null | sort -rn | head -20

6. Architecture Diagram

Produce a visual representation:

## Architecture Diagram

Produce a visual representation of the system architecture.


### System Context

[External Users] --> [Web App (Next.js)] [Web App] --> [API Server (Express)] [API Server] --> [PostgreSQL] [API Server] --> [Redis Cache] [API Server] --> [External API (Stripe)] [API Server] --> [Message Queue (Bull)] [Worker] --> [Message Queue] [Worker] --> [S3 Storage]


### Module Dependency Graph

src/api/ --> src/services/ src/services/ --> src/repositories/ src/services/ --> src/utils/ src/repositories/ --> src/models/ src/models/ --> (ORM)

7. Finding Specific Code Patterns

# Find all API endpoints
grep -r "app\.\(get\|post\|put\|delete\|patch\)" src/ | sed 's/.*\(get\|post\|put\|delete\|patch\)/\1/' | sort

# Find all database models
grep -r "class.*Model\|@Entity\|@Table\|Schema(" src/ | head -20

# Find all environment variable usage
grep -r "process\.env\.\|os\.environ\|os\.getenv" src/ | sort | uniq

# Find all TODO/FIXME/HACK
grep -r "TODO\|FIXME\|HACK\|XXX" src/ | head -20

# Find all authentication/authorization checks
grep -r "auth\|authenticate\|authorize\|permission\|role\|middleware" src/ | head -20

Output Format

## Codebase Analysis Report

Structured output template for the codebase analysis.


### Overview
- **Project**: [name]
- **Purpose**: [one sentence]
- **Stack**: [language, framework, database, key dependencies]
- **Size**: [files, lines of code, modules]

### Architecture
- **Pattern**: [MVC / hexagonal / microservices / monolith / serverless]
- **Entry point**: [file:line]
- **Key modules**: [list with one-line descriptions]

### Data Flow
- **Primary flow**: [input -> processing -> storage -> output]
- **Key data models**: [list]
- **External integrations**: [list]

### Conventions
- [Naming, imports, error handling, testing patterns]

### Hotspots
- **Most complex**: [file] -- [why]
- **Most changed**: [file] -- [why]
- **Highest coupling**: [file] -- [why]

### Risks
- [Technical debt, missing tests, security concerns]

### Recommendations
- [Actionable findings for the team]

Common Rationalizations

Rationalization	Reality
"I will just read the README"	READMEs describe intent, not reality. The actual code tells the truth. Read both, trust the code.
"I know enough, start coding"	Premature coding on unfamiliar codebases produces wrong abstractions. Invest 30 minutes in mapping first.
"This codebase is too large to understand"	You do not need to understand everything. Map the architecture, trace the relevant data flow, and understand the conventions. That is enough.
"The code is self-documenting"	Self-documenting code tells you what it does, not why. Comments, commit messages, and architecture docs tell you why.
"Just grep for what I need"	Grepping without understanding architecture gives fragments. You need the map before you can use the fragments.

Red Flags

Making changes without understanding the module's dependencies
Ignoring existing conventions (naming, error handling, patterns)
Assuming the codebase structure based on another project
Not checking what tests exist before changing code
Tracing only the happy path and missing error handling flows
Not identifying the entry points and boundary conditions

Verification

After code research, confirm:

Architecture pattern identified (MVC, hexagonal, etc.)
Entry points mapped (where does execution begin)
Key dependencies listed (what does this project depend on)
Data flow traced for the relevant feature (input to output)
Code conventions documented (naming, imports, error handling)
Hotspots identified (complex, changed, coupled code)
Test patterns understood (framework, naming, coverage)
Output is structured enough that another agent could use it
No assumptions made without evidence from the actual code
No [TODO] or placeholder content in the analysis