name: pgvector-setup description: | Supabase pgvector setup, RAG pipeline, and vector search for StartupAI. Covers knowledge_chunks schema, HNSW/IVFFlat indexes, search_knowledge + hybrid_search_knowledge RPCs, embedding generation via OpenAI, Edge Function integration, and Gemini web search grounding.
Trigger when user asks to:
- Set up or modify vector search, embeddings, or knowledge base
- Ingest documents into knowledge_chunks
- Debug search quality or missing results
- Tune HNSW parameters or search performance
- Wire RAG into edge functions or AI chat
- Use Google Search grounding or URL Context with Gemini
This project: OpenAI text-embedding-3-small (1536 dims), stored in knowledge_chunks, HNSW index, search via search_knowledge() and hybrid_search_knowledge() RPCs. allowed-tools: Bash, Read, Write, Edit
pgvector for StartupAI
Our Setup (Live DB — audited 2026-03-19)
| Component | Value |
|---|---|
| Extension | pgvector 0.8.0 |
| Embedding model | OpenAI text-embedding-3-small |
| Dimensions | 1536 |
| Table | knowledge_chunks |
| Index | HNSW (m=16, ef_construction=64, vector_cosine_ops) |
| FTS | GIN index on fts tsvector column |
| Total chunks | 3,973 (3,865 with embeddings) |
| Documents | 89 (all with content_hash dedup) |
| Search RPCs | search_knowledge + hybrid_search_knowledge |
| RLS | Authenticated SELECT; service_role writes |
Industry Distribution (problem: 78% AI/ML)
| Industry | Chunks | With Embedding |
|---|---|---|
| ai_ml | 3,106 | 3,102 |
| fashion | 508 | 508 |
| retail | 219 | 219 |
| GTM | 22 | 22 |
| saas | 36 | 0 |
| 7 others | 1-2 each | 0 |
Gap: 16 of 19 industries have <3 chunks. healthtech, fintech, ecommerce, education, legal, manufacturing, etc. are essentially empty.
Empty Tables
industry_playbooks— 0 rows (edge functions query this and get nothing)prompt_packs— 0 rows (prompt-pack edge function queries this)
Schema
-- knowledge_chunks (main vector table)
CREATE TABLE knowledge_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL CHECK (length(content) >= 20),
embedding vector(1536),
source TEXT,
source_type TEXT,
year INT,
confidence TEXT, -- high/medium/low
category TEXT,
industry TEXT,
document_id UUID REFERENCES knowledge_documents(id),
page_start INT,
page_end INT,
section_title TEXT,
chunk_kind TEXT, -- text/table
chunk_index INT,
source_path TEXT,
tags TEXT[],
fts TSVECTOR, -- auto-maintained via trigger
fetch_count INT DEFAULT 0,
last_fetched_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
-- knowledge_documents (source tracking)
CREATE TABLE knowledge_documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title TEXT NOT NULL,
source_type TEXT,
year INT,
llama_parse_id TEXT, -- LlamaCloud job ID for PDFs
content_hash TEXT NOT NULL, -- SHA-256 dedup
created_at TIMESTAMPTZ DEFAULT now()
);
Indexes (12 total)
-- Vector search (HNSW)
CREATE INDEX knowledge_chunks_embedding_idx
ON knowledge_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Full-text search
CREATE INDEX idx_knowledge_chunks_fts ON knowledge_chunks USING gin (fts);
-- Dedup (prevents re-embedding same content)
CREATE UNIQUE INDEX idx_knowledge_chunks_doc_content_hash
ON knowledge_chunks (document_id, md5(content))
WHERE document_id IS NOT NULL;
-- Filter indexes
CREATE INDEX idx_knowledge_chunks_category ON knowledge_chunks (category);
CREATE INDEX idx_knowledge_chunks_industry ON knowledge_chunks (industry) WHERE industry IS NOT NULL;
CREATE INDEX idx_knowledge_chunks_source_type ON knowledge_chunks (source_type);
CREATE INDEX idx_knowledge_chunks_year ON knowledge_chunks (year DESC);
CREATE INDEX idx_knowledge_chunks_category_year ON knowledge_chunks (category, year DESC);
CREATE INDEX idx_knowledge_chunks_tags ON knowledge_chunks USING gin (tags);
CREATE INDEX idx_knowledge_chunks_document_id ON knowledge_chunks (document_id) WHERE document_id IS NOT NULL;
Search Functions
search_knowledge (semantic only)
search_knowledge(
query_embedding vector,
match_threshold float DEFAULT 0.75,
match_count int DEFAULT 5,
filter_category text DEFAULT NULL,
filter_industry text DEFAULT NULL
) RETURNS TABLE(id, content, source, source_type, year, confidence, category, tags, similarity)
hybrid_search_knowledge (semantic + keyword with RRF)
hybrid_search_knowledge(
query_embedding vector,
query_text text,
match_threshold float DEFAULT 0.75,
match_count int DEFAULT 10,
filter_category text DEFAULT NULL,
filter_industry text DEFAULT NULL,
rrf_k int DEFAULT 50
) RETURNS TABLE(id, content, source, source_type, year, confidence, category, tags, similarity,
document_id, document_title, section_title, page_start, page_end)
How RRF works: Runs vector search (top 20) and FTS search (top 20) independently, then combines ranks using 1/(k+rank) formula. Captures both semantic meaning and exact keyword matches.
Calling from Edge Functions
// Via service-role client (bypasses RLS for pipeline callers)
const { data } = await supabaseAdmin.rpc('hybrid_search_knowledge', {
query_embedding: embedding,
query_text: userQuery,
match_threshold: 0.5,
match_count: 10,
filter_industry: 'healthcare',
});
Embedding Generation
File: supabase/functions/_shared/openai-embeddings.ts
// Single embedding
const result = await generateEmbedding(text);
// result.embedding = number[1536]
// Batch (20 at a time, 500ms throttle)
const results = await generateEmbeddings(texts);
Known issue: No Promise.race timeout — OpenAI calls can hang. Add 15s timeout wrapper.
RAG Integration Points
Where RAG is wired today
| Feature | File | How |
|---|---|---|
| AI Chat (coaching modes) | ai-chat/rag.ts |
getRAGContext() → injects top-5 chunks into Gemini system prompt |
| Canvas Coach | lean-canvas-agent/actions/coach.ts |
searchCoachingContext() → service-role RPC → citations as badges |
Where RAG should be wired (not yet done)
| Feature | Integration point | Expected impact |
|---|---|---|
| Validator Composer | validator-start/agents/composer.ts Group D |
Cite benchmarks in executive summary |
| Sprint Agent | sprint-agent/index.ts |
Industry-specific task prioritization |
| Pitch Deck Generation | pitch-deck-agent/actions/generation.ts |
Market data in slides |
| Investor Agent | investor-agent/index.ts |
Stage-specific fundraising benchmarks |
RAG failure pattern (graceful degradation)
try {
const context = await getRAGContext(supabase, query, filterIndustry);
// Inject into system prompt if non-empty
} catch (e) {
console.warn("[RAG] Failed:", e);
return ""; // Fail open — chat continues without RAG
}
Web Search (Not pgvector — Gemini API features)
For live web data, use Gemini's built-in search instead of vectorizing URLs.
Google Search Grounding
const result = await callGeminiChat(model, systemPrompt, messages, {
useSearch: true, // Gemini searches Google in real-time
timeoutMs: 45_000,
});
// Citations in result.citations (extracted from groundingMetadata)
Where used: Validator Research Agent, Competitors Agent, AI Chat research mode.
URL Context (read a specific webpage)
const result = await callGemini(model, systemPrompt, userPrompt, {
useUrlContext: true, // Gemini fetches and reads the URL
});
Where used: profile-import, Research Agent (reads competitor URLs).
When to use which
| Content Type | Method | Why |
|---|---|---|
| Our curated benchmarks/playbooks | pgvector | Static, authoritative, fast |
| Market trends, news, competitor data | Google Search | Always fresh |
| Specific website content | URL Context | Reads actual page |
| PDF reports we've ingested | pgvector | Already chunked + embedded |
| External research links (research/links/) | Google Search / URL Context | Live reads get latest data |
Ingestion Pipeline
Scripts
# Batch ingest all markdown docs
node scripts/ingest-industry-docs-batch.mjs
# Single directory ingest
node scripts/ingest-markdown-knowledge.mjs --dir=research/<path> --industry=<tag>
# Verify vectors after ingest
node scripts/verify-vector-rag.mjs
Edge Function: knowledge-ingest
Auth: X-Internal-Token header (not JWT — used by scripts)
Actions: ingest, ingest_from_llamacloud, generate_embeddings, status, test_search
Chunking strategy:
- Heading-based segmentation (## and ### level)
- Table extraction (entire tables as single chunks, 3600-char max)
- Narrative text: 3600-char chunks, 200-char overlap
- SHA-256 content-hash dedup (unique index prevents re-embedding)
HNSW Tuning Guide
Parameters
| Parameter | Our Value | Range | Effect |
|---|---|---|---|
m |
16 | 12-48 | Higher = better recall, more memory |
ef_construction |
64 | 64-256 | Higher = better graph quality, slower builds |
ef_search |
40 (default) | 40-400 | Higher = better recall, slower queries |
When to change
- Poor recall (missing relevant results): Increase
ef_searchfirst (cheapest). Then increasem+ef_construction(requires reindex). - High latency: Decrease
ef_search. Check if index fits in memory (pg_size_pretty). - Filtered search returns too few results: Enable iterative scan:
SET hnsw.iterative_scan = relaxed_order; SET hnsw.max_scan_tuples = 50000;
Performance at our scale (3,973 chunks, 1536-dim)
At this data volume, HNSW is massive overkill — queries are <50ms. Tuning only matters at 100K+ vectors. Current config is fine.
Monitoring
-- Index size
SELECT pg_size_pretty(pg_relation_size('knowledge_chunks_embedding_idx'));
-- Query plan (verify index is used)
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content FROM knowledge_chunks
ORDER BY embedding <=> $1::vector(1536) LIMIT 10;
-- Check for chunks without embeddings
SELECT count(*) FROM knowledge_chunks WHERE embedding IS NULL;
Common Issues
| Symptom | Cause | Fix |
|---|---|---|
| Search returns AI/ML data for non-AI founders | 78% of chunks are ai_ml | Ingest more industry-specific content |
industry_playbooks queries return empty |
0 rows in table | Seed 19 industry playbooks |
prompt_packs queries return empty |
0 rows in table | Seed prompt pack definitions |
| 108 chunks missing embeddings | saas + null-industry chunks | Run generate_embeddings action |
| Hybrid search not finding keyword matches | FTS query syntax edge cases | Use websearch_to_tsquery (handles phrases, AND/OR) |
| Search quality low for non-English content | text-embedding-3-small optimized for English | Consider multilingual model if needed |
Priority Improvements
P0 — Fill empty industries:
- Ingest
research/services/02-12*.md(11 industry guides) - Ingest
research/sports/reports/*.md(5 synthesis reports) - Seed
industry_playbookstable (19 rows)
P1 — Wire RAG into more features:
- Add
getRAGContext()to Validator Composer (Group D) - Add
getRAGContext()to Sprint Agent - Add
Promise.race(15s)toopenai-embeddings.ts
P2 — Expand knowledge base:
- Ingest remaining PDFs via LlamaCloud
- Generate embeddings for 108 missing chunks
- Seed
prompt_packstable
Docs: https://supabase.com/docs/guides/ai Version: 2.0.0 Last Updated: 2026-03-19