textbook-vectorization - SKILL.md Agent Skill

name: textbook-vectorization description: Vectorize PDF textbooks for semantic search. Use when (1) need to quickly find concepts in textbooks, (2) building personal knowledge base, (3) preparing study materials with searchable content, (4) want to query textbook semantically.

Textbook Vectorization

Objectives

Extract text from PDF textbooks
Generate vector embeddings for semantic search
Enable fast concept lookup across large textbooks
Support both local (free) and cloud-based embedding models

Key Instructions

1. Choose Embedding Model

Local Model (Free):

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # ~80MB, runs locally

Cloud Model (Paid):

from openai import OpenAI
client = OpenAI()
# Use text-embedding-3-small (~$0.0001/1K tokens)

2. Extract and Chunk Text

import pypdf
from pathlib import Path

def extract_text_from_pdf(pdf_path: Path) -> list:
    pages = []
    with open(pdf_path, 'rb') as file:
        reader = pypdf.PdfReader(file)
        for page_num, page in enumerate(reader.pages, 1):
            text = page.extract_text()
            if text.strip():
                pages.append({'page': page_num, 'text': text.strip()})
    return pages

def chunk_text(pages: list, chunk_size: int = 500, overlap: int = 50) -> list:
    chunks = []
    for page_data in pages:
        text = page_data['text']
        start = 0
        while start < len(text):
            end = start + chunk_size
            chunk = text[start:end].strip()
            if chunk:
                chunks.append({
                    'page': page_data['page'],
                    'text': chunk
                })
            start = end - overlap
    return chunks

3. Generate Embeddings

Batch processing for efficiency:

def get_embeddings(texts: list, model) -> list:
    batch_size = 32
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings = model.encode(batch, show_progress_bar=False)
        all_embeddings.extend(embeddings.tolist())
    
    return all_embeddings

4. Save Vector Database

Single JSON file format:

import json

data = {
    'metadata': {
        'total_chunks': len(chunks),
        'embedding_model': 'all-MiniLM-L6-v2',
        'embedding_dim': len(embeddings[0]),
        'chunk_size': 500,
        'chunk_overlap': 50
    },
    'chunks': [
        {
            'id': i,
            'page': chunk['page'],
            'text': chunk['text'],
            'embedding': embedding
        }
        for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
    ]
}

with open('textbook_vectors.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

5. Query the Vectorized Textbook

import numpy as np

def cosine_similarity(vec1: list, vec2: list) -> float:
    vec1, vec2 = np.array(vec1), np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def search(query: str, data: dict, model, top_k: int = 5):
    # Get query embedding
    query_embedding = model.encode([query])[0].tolist()
    
    # Calculate similarities
    results = []
    for chunk in data['chunks']:
        similarity = cosine_similarity(query_embedding, chunk['embedding'])
        results.append((chunk, similarity))
    
    # Sort and return top results
    results.sort(key=lambda x: x[1], reverse=True)
    return results[:top_k]

Workflow

Initial Setup

Install dependencies:

uv add pypdf sentence-transformers numpy

Prepare PDF textbook in accessible location
Run vectorization script (one-time, ~5-10 minutes for 500-page book)

Query Usage

Interactive mode:

while True:
    query = input("🔍 Query > ").strip()
    if query.lower() in ['quit', 'exit']:
        break
    
    results = search(query, data, model, top_k=3)
    for i, (chunk, score) in enumerate(results, 1):
        print(f"\n[Result {i}] Similarity: {score:.4f} | Page: {chunk['page']}")
        print(chunk['text'][:300] + "...")

Single query:

uv run python query_textbook.py "What is temporal difference learning?"

Configuration

Chunk Size Tuning

Small chunks (200-300 chars): More precise, more chunks, larger file
Medium chunks (500-700 chars): Balanced (recommended)
Large chunks (1000+ chars): Broader context, fewer chunks

Overlap Strategy

No overlap (0): Faster, may miss boundary concepts
Small overlap (50-100): Recommended for most cases
Large overlap (200+): Better coverage, more redundancy

Cost Estimation

Local Model (Free)

Model download: ~80MB (one-time)
Processing time: ~5-10 min for 500 pages
Storage: ~50-100MB JSON file

Cloud Model (OpenAI)

text-embedding-3-small: ~$0.0001/1K tokens
500-page textbook: ~$0.50-$2.00
Faster processing, no local compute needed

Validation

Before using the vectorized textbook:

Verify JSON file size is reasonable (50-200MB for typical textbook)
Test query with known concept from textbook
Check that page numbers are preserved correctly
Ensure embeddings dimension matches model output

Common Issues

Issue: PDF extraction returns gibberish

Solution: PDF may be scanned images. Use OCR first:

uv add pytesseract pdf2image
# Convert to images, then OCR

Issue: Out of memory during embedding

Solution: Reduce batch size:

batch_size = 16  # or even 8

Issue: Query returns irrelevant results

Solution:

Try different query phrasing
Reduce chunk size for more precision
Increase top_k to see more results

Best Practices

Version control: Save vectorization parameters in metadata
Incremental updates: Re-vectorize only changed chapters
Multiple textbooks: Use separate JSON files or add book_id field
Query refinement: Start broad, then narrow down with specific terms
Backup: Keep original PDF and vectors separately

Reference Scripts

See courses/rl/scripts/ for complete implementation:

vectorize_textbook.py - Full vectorization pipeline
query_textbook.py - Interactive and CLI query interface