name: vectorize description: "Manage codebase and database vectorization for semantic search. Use when initializing, refreshing, or querying the vector index. Triggers on: vectorize init, vectorize refresh, vectorize search, semantic search, vector index, enable vectorization."
Vectorize Skill
Manage codebase and database vectorization for semantic search capabilities.
Overview
Vectorization enables agents to query project knowledge semantically instead of relying solely on grep/glob. This skill provides CLI commands to initialize, refresh, and query the vector index.
Benefits
- Semantic search: Ask "How does authentication work?" instead of grep for "auth"
- 49% fewer retrieval failures with Contextual Retrieval
- Hybrid search: Combines semantic understanding with keyword matching
- Database awareness: Agents understand your schema and config tables
Requirements
- Embedding API Key (one of):
OPENAI_API_KEY— For OpenAI embeddings (text-embedding-3-small)VOYAGE_API_KEY— For Voyage AI embeddings (voyage-code-3) ⭐ Recommended for code
- ANTHROPIC_API_KEY: Required for Contextual Retrieval (Claude Haiku)
- DATABASE_URL: Optional, for database schema indexing
Commands
vectorize init
Initialize vectorization for the current project.
# From project root
vectorize init
What it does:
- Checks for required API keys in environment
- Adds
vectorizationsection toproject.json - Creates
.vectorindex/directory (gitignored) - Scans codebase and creates initial index
- Installs git post-commit hook for automatic updates
- Optionally indexes database schema
Output:
Initializing vectorization for my-project...
Detected stack: Next.js + TypeScript + Supabase
Found 1,247 source files
Configuration:
Embedding model: Voyage AI voyage-code-3
Contextual retrieval: enabled
Storage: local (.vectorindex/)
Database detected:
DATABASE_URL found in environment
Type: PostgreSQL (Supabase)
Include schema indexing? (y/n): y
Building index...
Chunking: 1,247 files → 8,453 chunks
Contextual: Adding descriptions (Claude Haiku)
Embedding: 8,453 chunks → vectors
[████████████████████] 100%
Installing git hooks...
post-commit hook installed
✅ Vectorization ready!
Index: 8,453 chunks (42MB)
Cost: $2.34 (one-time)
Next steps:
• Agents will automatically use semantic search
• Run 'vectorize search <query>' to test
• Run 'vectorize status' to check index health
vectorize refresh
Rebuild the vector index (full or incremental).
# Incremental refresh (only changed files)
vectorize refresh
# Full rebuild
vectorize refresh --full
When to use:
- After major refactoring
- If index seems stale or corrupted
- After adding database config tables
vectorize status
Show index statistics and health.
vectorize status
Output:
Vector Index Status: my-project
Index Location: .vectorindex/
Last Updated: 2026-02-28 10:30:45 (2 hours ago)
Index Age: OK (within 24h threshold)
Codebase:
Files indexed: 1,247
Chunks: 8,453
Languages: TypeScript (1,102), JavaScript (89), Markdown (56)
Database:
Schema: 23 tables, 187 columns
Config tables: pricing_tiers (10 rows), feature_flags (15 rows)
Storage:
Vector index: 38MB
BM25 index: 4MB
Total: 42MB
Configuration:
Embedding model: voyage (voyage-code-3)
Contextual retrieval: enabled
Hybrid weight: 0.7 (semantic)
Top-K: 20
vectorize search <query>
Test semantic search from the command line.
vectorize search "How does user authentication work?"
Output:
Found 8 relevant chunks for "How does user authentication work?"
1. src/auth/middleware.ts (lines 45-89) [score: 0.94]
┌─────────────────────────────────────────────────────────────────
│ // JWT verification middleware
│ export async function verifyAuth(req: Request) {
│ const token = req.headers.get('Authorization')?.replace('Bearer ', '');
│ if (!token) throw new AuthError('Missing token');
│
│ const payload = await verifyJWT(token, process.env.JWT_SECRET);
│ return { userId: payload.sub, role: payload.role };
│ }
└─────────────────────────────────────────────────────────────────
2. src/auth/providers/supabase.ts (lines 12-67) [score: 0.91]
┌─────────────────────────────────────────────────────────────────
│ // Supabase auth provider implementation
│ export const supabaseAuth = {
│ signIn: async (email: string, password: string) => {
│ const { data, error } = await supabase.auth.signInWithPassword({
│ email, password
│ });
│ ...
└─────────────────────────────────────────────────────────────────
3. docs/ARCHITECTURE.md (lines 156-180) [score: 0.87]
┌─────────────────────────────────────────────────────────────────
│ ## Authentication Design
│
│ We use Supabase Auth with JWT tokens. The flow:
│ 1. User signs in via Supabase
│ 2. Frontend stores access token
│ 3. API routes verify via middleware
│ ...
└─────────────────────────────────────────────────────────────────
[5 more results...]
vectorize config
Show current vectorization settings.
vectorize config
Implementation
Directory Structure
<project>/
├── .vectorindex/ # Gitignored
│ ├── codebase.lance/ # LanceDB table for code embeddings
│ ├── database.lance/ # LanceDB table for schema/config embeddings
│ ├── bm25/ # BM25 keyword index
│ ├── metadata.json # Index state, timestamps, chunk count
│ └── contexts/ # Cached contextual descriptions
├── docs/
│ └── project.json # Contains vectorization config
Embedding Models
| Provider | Model | Best For | Env Var |
|---|---|---|---|
| Voyage AI | voyage-code-3 |
Code retrieval ⭐ | VOYAGE_API_KEY |
| Voyage AI | voyage-3.5 |
General purpose | VOYAGE_API_KEY |
| Voyage AI | voyage-3.5-lite |
Low latency/cost | VOYAGE_API_KEY |
| OpenAI | text-embedding-3-small |
General purpose | OPENAI_API_KEY |
| OpenAI | text-embedding-3-large |
Higher quality | OPENAI_API_KEY |
| Ollama | Local models | Free, offline | None (local) |
Recommendation: Use voyage-code-3 for code search. It's specifically optimized for code retrieval and is recommended by Anthropic.
Token-Aware Batching
When using Voyage AI embeddings, the system automatically batches chunks to stay within API token limits:
| Limit | Value | Purpose |
|---|---|---|
| Token limit | 50,000 per batch | Stay under Voyage 120k API limit (conservative) |
| Chunk limit | 100 per batch | API batch size limit |
How it works:
- Each chunk's token count is estimated (~2 chars per token for code)
- Chunks are added to a batch until token limit would be exceeded
- Batch is sent to API, next batch starts
- Process repeats until all chunks embedded
Benefits:
- Large codebases handled efficiently
- Optimal API usage (fewer calls, larger batches)
- Automatic — no configuration required
Output modes:
| Mode | Command | Output |
|---|---|---|
| Default | vectorize refresh |
Progress bar + summary ("100 chunks in 3 batches") |
| Verbose | vectorize refresh --verbose |
Per-batch breakdown |
| Quiet | vectorize refresh --quiet |
Errors only (for CI/scripts) |
Default output:
Building index...
[████████████████████] 100%
Total: 8,453 chunks in 85 batches, 85 API calls
Verbose output:
Building index...
Batch 1: 98 chunks, ~49,500 tokens
Batch 2: 97 chunks, ~48,200 tokens
...
Batch 85: 12 chunks, ~5,100 tokens (final)
[████████████████████] 100%
Total: 8,453 chunks in 85 batches, 85 API calls
Configuration in project.json
{
"vectorization": {
"enabled": true,
"storage": "local",
"embeddingModel": "voyage-code-3",
"contextualRetrieval": "auto",
"codebase": {
"include": ["src/**", "lib/**", "docs/**"],
"exclude": ["node_modules/**", "dist/**", "*.test.ts"],
"chunkStrategy": "ast"
},
"database": {
"enabled": true,
"connection": "env:DATABASE_URL",
"type": "postgres",
"schema": {
"include": ["public.*"],
"exclude": ["public.migrations"]
},
"configTables": [
{
"table": "public.pricing_tiers",
"description": "Subscription pricing and feature limits",
"sampleRows": 10
}
]
},
"search": {
"hybridWeight": 0.7,
"topK": 20,
"reranking": {
"enabled": false,
"model": "cross-encoder"
}
},
"refresh": {
"onGitChange": true,
"onSessionStart": true,
"maxAge": "24h"
},
"credentials": {
"voyage": "env:VOYAGE_API_KEY",
"openai": "env:OPENAI_API_KEY",
"anthropic": "env:ANTHROPIC_API_KEY"
}
}
}
Agent Integration
semantic_search Tool
When vectorization is enabled, agents have access to a semantic_search tool:
// Tool signature
semantic_search({
query: string, // Natural language query
filters?: {
filePatterns?: string[], // e.g., ["src/auth/**", "*.ts"]
languages?: string[], // e.g., ["typescript", "python"]
contentType?: "code" | "schema" | "config" | "docs"
},
topK?: number // Override default (20)
})
// Returns
{
results: [
{
content: string, // Chunk content
filePath: string, // e.g., "src/auth/middleware.ts"
lineRange: [45, 89], // Start and end lines
language: string, // e.g., "typescript"
score: number, // Relevance score (0-1)
type: "code" | "schema" | "config" | "docs"
}
],
indexAge: string, // e.g., "2 hours ago"
queryTime: number // Milliseconds
}
Agent Usage
Agents automatically use semantic search when:
vectorization.enabled: truein project.json- Index exists in
.vectorindex/ - Index is not stale (within
maxAge)
Example agent prompt usage:
// @builder looking for authentication patterns
Before implementing the auth feature, let me search for existing patterns:
semantic_search("How is authentication implemented?")
→ Found middleware in src/auth/middleware.ts
→ Found provider in src/auth/providers/supabase.ts
→ Found architecture docs explaining the flow
Now I can implement consistent with existing patterns.
Chunking Strategy
AST Chunking (Default)
Uses Tree-sitter for language-aware chunking:
- TypeScript/JavaScript: Functions, classes, methods, exports
- Python: Functions, classes, methods, modules
- Go: Functions, methods, structs, interfaces
- Rust: Functions, impls, structs, enums
- Java: Classes, methods, interfaces
Chunks respect semantic boundaries. Large functions (>500 tokens) are split with overlap.
Sliding Window Fallback
For unsupported languages or config files:
- Window size: 256 tokens
- Overlap: 50 tokens
- Preserves context across chunk boundaries
Markdown/Docs Chunking
- Section-based chunking (by headings)
- Preserves heading hierarchy in context
- Code blocks kept intact
Contextual Retrieval
When enabled, each chunk is enriched with a brief contextual description before embedding.
How it works:
- Read the full source file
- For each chunk, ask Claude Haiku: "Given this file, describe what this chunk does in 50-100 tokens"
- Prepend the description to the chunk
- Embed the enriched chunk
Example:
Original chunk:
export async function verifyAuth(req: Request) {
const token = req.headers.get('Authorization')?.replace('Bearer ', '');
if (!token) throw new AuthError('Missing token');
return verifyJWT(token, process.env.JWT_SECRET);
}
With context:
[This function is the main authentication middleware in the auth module. It extracts
the JWT token from the Authorization header and verifies it using the JWT_SECRET
environment variable. It's used by all protected API routes.]
export async function verifyAuth(req: Request) {
const token = req.headers.get('Authorization')?.replace('Bearer ', '');
if (!token) throw new AuthError('Missing token');
return verifyJWT(token, process.env.JWT_SECRET);
}
Benefits:
- 49% fewer retrieval failures (per Anthropic research)
- Better understanding of chunk's role in codebase
- Improved semantic matching
Cost:
- ~$5 per 10k files (one-time)
- Uses prompt caching to reduce costs
- Only reruns for changed files
Database Indexing
Schema Extraction
Extracts and indexes:
- Table names and descriptions
- Column names, types, and constraints
- Foreign key relationships
- Indexes
- Table/column comments
Example indexed content:
Table: public.users
Description: Application users and their profiles
Columns:
- id: uuid (primary key, default: gen_random_uuid())
- email: text (unique, not null)
- password_hash: text (not null)
- full_name: text
- role: text (default: 'user', check: role in ('user', 'admin', 'moderator'))
- created_at: timestamptz (default: now())
- updated_at: timestamptz
Foreign keys:
- organization_id → organizations(id)
Indexes:
- users_email_idx on (email)
- users_org_idx on (organization_id)
Config Table Extraction
For designated config tables, extracts sample rows:
Table: public.pricing_tiers
Description: Subscription pricing and feature limits
Sample rows:
| name | price_monthly | price_yearly | max_users | features |
|------------|---------------|--------------|-----------|---------------------|
| Free | 0 | 0 | 1 | ["basic"] |
| Pro | 29 | 290 | 5 | ["basic", "api"] |
| Enterprise | 99 | 990 | unlimited | ["basic", "api", …] |
Git Integration
Post-commit Hook
Installed automatically by vectorize init:
#!/bin/sh
# .git/hooks/post-commit
# Get changed files
CHANGED_FILES=$(git diff-tree --no-commit-id --name-only -r HEAD)
# Run incremental vectorize
if [ -d ".vectorindex" ]; then
npx vectorize refresh --incremental --files "$CHANGED_FILES"
fi
Session Start Check
When an agent session starts:
- Check if
.vectorindex/metadata.jsonexists - Compare
lastUpdatedtimestamp with current time - If older than
maxAge(default 24h), prompt for refresh - Compare with
git log HEADto detect missed commits
Cost Estimates
| Codebase Size | Files | Chunks | Embedding Cost | Contextual Cost | Total |
|---|---|---|---|---|---|
| Small | 500 | 3k | ~$0.01 | ~$1.50 | ~$1.51 |
| Medium | 2k | 12k | ~$0.02 | ~$6.00 | ~$6.02 |
| Large | 10k | 60k | ~$0.10 | ~$30.00 | ~$30.10 |
- Costs are one-time for initial indexing
- Incremental updates cost ~1% of full index per commit
- Contextual retrieval can be disabled to reduce costs
Troubleshooting
"VOYAGE_API_KEY not found"
Get an API key from Voyage AI and set it:
export VOYAGE_API_KEY=pa-...
Or add to your shell profile (~/.zshrc, ~/.bashrc).
"OPENAI_API_KEY not found"
Set the environment variable:
export OPENAI_API_KEY=sk-...
Or add to your shell profile (~/.zshrc, ~/.bashrc).
"Index is stale"
Run refresh:
vectorize refresh
"No results for my query"
- Check if file is included in
codebase.includepatterns - Try different query phrasing
- Use
vectorize searchto test different queries - Check
vectorize statusfor index health
"High embedding costs"
- Use Voyage lite: Set
embeddingModel: "voyage-3.5-lite"for lower cost - Disable contextual retrieval: Set
contextualRetrieval: "never" - Use local Ollama: Set
embeddingModel: "ollama"(free) - Reduce include patterns to essential directories
Best Practices
- Include documentation: Add
docs/**to include patterns - Exclude generated code: Add
dist/**,build/**,.next/** - Exclude tests initially: Add
*.test.tsto reduce noise - Use config tables: Designate reference data tables for agent context
- Keep index fresh: Enable
onGitChangehook - Review costs first: Run
vectorize init --dry-runto see estimates