efficient-table-retrieval-understanding - SKILL.md Agent Skill

name: "efficient-table-retrieval-understanding" description: | Build TabRAG-style pipelines that retrieve relevant tables from large image collections and answer natural language queries over them using multimodal LLMs. Implements a three-stage retrieve-rerank-reason architecture for table question answering at scale.

Trigger phrases: - "find the right table and answer my question" - "search across table images to answer a query" - "build a table retrieval pipeline" - "RAG over table images" - "table QA from document scans" - "retrieve and reason over tabular data"

Efficient Table Retrieval and Understanding (TabRAG)

This skill enables Claude to design and implement TabRAG-style pipelines — three-stage systems that (1) retrieve candidate tables from large image collections using visual-text embeddings, (2) rerank candidates with a multimodal LLM for fine-grained relevance scoring, and (3) reason over the top-ranked tables to generate answers. The technique is based on the EACL 2026 paper by Xu et al. and is specifically designed for scenarios where the relevant table is not known in advance and must be found from thousands of table images.

When to Use

When a user needs to answer natural language questions against a large collection of table images (financial reports, scanned documents, handwritten records)
When building a retrieval-augmented generation (RAG) system specifically for tabular data stored as images rather than structured text
When the user wants to search for the correct table among thousands before performing QA, fact verification, or text generation
When integrating table understanding into a document processing pipeline that handles PDFs, scans, or screenshots containing tables
When the user asks to compare or benchmark different table retrieval strategies (CLIP vs. fine-tuned visual-text models vs. OCR-based approaches)
When building an end-to-end pipeline that goes from "I have 50,000 table images" to "answer this question"

Key Technique

TabRAG decomposes the problem of answering queries over large table collections into three stages with distinct computational profiles. Stage 1 (Retrieval) uses jointly trained visual and text encoders — a vision encoder like LayoutLMv3 for table images and a text encoder like GTE for queries — to compute embeddings and perform fast approximate nearest-neighbor search via FAISS. This runs in ~57ms and filters thousands of tables down to ~10 candidates. The key insight is that layout-aware vision encoders (LayoutLMv3) outperform generic vision encoders (CLIP) because they understand spatial structure inherent in tables. The encoders are trained with contrastive learning (InfoNCE loss) to maximize cosine similarity between matched query-table pairs.

Stage 2 (Reranking) passes the top-k candidates through a multimodal LLM (e.g., Mistral-7B with a CLIP ViT visual encoder) that performs fine-grained relevance assessment. The MLLM is trained on three complementary tasks: retrieval-augmented QA, binary context ranking (True/False relevance classification), and multi-table relevance identification. During inference, the model outputs the probability of the "True" token as a relevance score. This stage costs ~810ms but provides the critical precision boost (+7.0% recall improvement).

Stage 3 (Answer Generation) feeds the top-ranked table images directly into an MLLM alongside the user query — no OCR conversion needed. The model reasons over the visual table representation to produce answers, achieving 6.1% higher accuracy than prior methods. The direct image input avoids OCR errors that plague text-based approaches, especially for complex layouts, merged cells, and handwritten data.

Step-by-Step Workflow

Inventory and preprocess the table collection. Catalog all table images, normalize to a consistent resolution, and strip any surrounding non-table content (headers, footers, page numbers). Store metadata (source document, page number) alongside each image for provenance tracking.
Encode all table images into embeddings. Use a layout-aware vision encoder (LayoutLMv3 or similar) to compute a fixed-size embedding vector for each table image. Store these in a FAISS index with IVF (inverted file) partitioning for sub-linear search. For collections under 100K tables, a flat L2 or cosine index is sufficient.
Encode the user query into the shared embedding space. Use a text encoder (GTE, or the text tower of your jointly trained model) to map the natural language query into the same vector space as the table embeddings. Preprocess the query by removing formatting instructions (e.g., "Show answer in JSON") that degrade embedding quality.
Retrieve top-k candidate tables via approximate nearest neighbor search. Query the FAISS index with the text embedding and retrieve the top 10-20 candidates ranked by cosine similarity. This stage is fast (~57ms) and acts as a coarse filter.
Rerank candidates with a multimodal LLM. For each candidate table image, construct a prompt: "For the question '{query}', assess whether this table contains relevant information. Answer True or False." Feed the table image and prompt to the MLLM, extract the logit probability for "True", and re-sort candidates by this score. Keep the top 1-3 tables.
Generate the answer using the top-ranked table(s). Construct a final prompt that includes the user query and the top-ranked table image(s) as visual inputs. The MLLM reasons directly over the image — no OCR intermediate step. Use task-appropriate prompting: "Answer the following question based on the table:" for QA, or "Verify whether the following statement is supported by the table:" for fact-checking.
Post-process and validate the answer. For numerical answers, verify units and magnitude against visible table data. For text generation tasks (summaries, descriptions), check that generated entities actually appear in the table. Return the answer along with a confidence indicator based on the reranking score.
Evaluate retrieval and generation quality. Measure retrieval with MRR (Mean Reciprocal Rank) and Recall@k. Measure generation with task-specific metrics: exact-match accuracy for QA/fact verification, BLEU for text generation. Log per-query retrieval rank for debugging.
Iterate on the contrastive training data. Collect hard negatives — tables that are visually similar but contain different data — and retrain the retrieval encoders. The original paper uses batch size 32, learning rate 2e-5 with cosine decay, and Adam optimizer (beta1=0.9, beta2=0.98).

Concrete Examples

Example 1: Financial Report QA Pipeline

User: "I have 5,000 table images extracted from quarterly SEC filings. Build a system that can answer questions like 'What was Apple's revenue in Q3 2025?'"

Approach:

Encode all 5,000 table images using a LayoutLMv3-based vision encoder, store in a FAISS IndexFlatIP (inner product for cosine similarity on normalized vectors).
For the query "What was Apple's revenue in Q3 2025?", encode with GTE text encoder.
Retrieve top-10 candidate tables by cosine similarity.
Rerank with MLLM: for each candidate, prompt "Does this table contain Apple's quarterly revenue data? Answer True or False." Sort by P(True).
Pass the top-1 table image + query to the MLLM: "Based on this table, what was Apple's revenue in Q3 2025?"

Output:

Retrieved table: SEC_filing_AAPL_10Q_2025Q3_page12.png (rerank score: 0.94)
Answer: Apple's revenue in Q3 2025 was $94.8 billion.
Retrieval rank: 1 (MRR: 1.0)

Example 2: Multi-Table Fact Verification

User: "I need to verify claims against a database of 20,000 Wikipedia tables stored as images. Check: 'The population of Tokyo exceeded 14 million in 2023.'"

Approach:

Index all 20,000 table images in FAISS with layout-aware embeddings.
Encode the claim as a query, retrieve top-10 tables.
Rerank using binary relevance classification — prompt the MLLM with each table image and the claim.
For the top-ranked table, prompt: "Based on this table, is the following statement SUPPORTED or REFUTED: 'The population of Tokyo exceeded 14 million in 2023.'"

Output:

Retrieved table: wiki_tokyo_demographics_table3.png (rerank score: 0.89)
Verdict: SUPPORTED
Evidence: Table shows Tokyo population as 14,094,034 for 2023.

Example 3: Building the Pipeline from Scratch in Python

User: "Show me how to set up the retrieval and reranking stages."

import faiss
import numpy as np
from transformers import AutoModel, AutoTokenizer
from PIL import Image

# Stage 1: Build the retrieval index
vision_encoder = AutoModel.from_pretrained("microsoft/layoutlmv3-base")
text_encoder = AutoModel.from_pretrained("thenlper/gte-base")
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")

# Encode all table images
table_embeddings = []
for img_path in table_image_paths:
    img = preprocess_image(img_path)  # resize, normalize
    emb = vision_encoder(img).last_hidden_state[:, 0, :]  # CLS token
    emb = emb / emb.norm(dim=-1, keepdim=True)  # L2 normalize
    table_embeddings.append(emb.detach().numpy())

table_embeddings = np.vstack(table_embeddings).astype("float32")
index = faiss.IndexFlatIP(table_embeddings.shape[1])  # cosine sim on normalized vecs
index.add(table_embeddings)

# Stage 1: Retrieve candidates
query = "What was the GDP growth rate in 2024?"
query_tokens = tokenizer(query, return_tensors="pt", padding=True, truncation=True)
query_emb = text_encoder(**query_tokens).last_hidden_state[:, 0, :]
query_emb = query_emb / query_emb.norm(dim=-1, keepdim=True)

scores, indices = index.search(query_emb.detach().numpy(), k=10)
candidate_paths = [table_image_paths[i] for i in indices[0]]

# Stage 2: Rerank with MLLM
rerank_scores = []
for path in candidate_paths:
    prompt = f"For the question '{query}', does this table contain relevant information? Answer True or False."
    score = mllm_score(image_path=path, prompt=prompt)  # P(True)
    rerank_scores.append(score)

top_table = candidate_paths[np.argmax(rerank_scores)]

# Stage 3: Generate answer
answer = mllm_generate(
    image_path=top_table,
    prompt=f"Answer the following question based on this table: {query}"
)

Best Practices

Do: Use layout-aware vision encoders (LayoutLMv3, DiT) over generic ones (CLIP ViT) for table image embeddings. Layout structure is critical signal for tables that generic models miss.
Do: Strip formatting instructions from queries before encoding ("Output as JSON", "Use markdown") — these degrade retrieval quality by adding noise to the embedding.
Do: Train retrieval encoders with hard negatives (tables from the same domain but with different data) to improve discrimination between visually similar tables.
Do: Feed table images directly to the MLLM rather than converting to text via OCR. Direct visual input preserves spatial relationships and avoids OCR errors on complex layouts.
Avoid: Skipping the reranking stage to save latency. The paper shows reranking provides the largest quality jump (WTQ accuracy: 17.19% with retrieval only vs. 19.19% with reranking). The 810ms cost is almost always worth it.
Avoid: Retrieving too few candidates (< 10) in stage 1. The initial retrieval is cheap; cast a wide net and let the reranker do precision work. Recall@10 is the metric that matters here.

Error Handling

Problem	Cause	Solution
Retrieval returns irrelevant tables	Query-image embedding space misalignment	Fine-tune encoders with contrastive loss on domain-specific query-table pairs
Reranker assigns high scores to wrong tables	Visually similar tables with different content	Add hard negatives to reranker training; increase candidate pool size
MLLM generates hallucinated numbers	Table image resolution too low for fine print	Ensure table images are at least 1024px on the long edge; use high-res visual encoders (ViT-L-336px)
FAISS search is slow on large collections	Flat index doesn't scale past ~1M vectors	Switch to IndexIVFFlat or IndexIVFPQ with nprobe tuning
Answer contradicts visible table data	MLLM over-relies on parametric knowledge	Add explicit instruction: "Answer ONLY based on the provided table, not your prior knowledge"

Limitations

Requires visual table inputs. If tables are already in structured formats (CSV, database), skip retrieval and use direct querying — this pipeline adds unnecessary complexity for structured data.
Reranking latency. At ~810ms for 10 candidates, the reranking stage may be too slow for real-time applications with sub-200ms requirements. Consider skipping it for latency-critical use cases and accepting lower accuracy.
Training data dependency. The contrastive retrieval model needs paired (query, table image) training data. Cold-start scenarios with no labeled pairs require synthetic data generation or zero-shot approaches.
Single-table reasoning. The framework retrieves and reasons over individual tables. Multi-table join operations (e.g., "Compare revenue across these three companies") require additional orchestration logic not covered by the base TabRAG architecture.
Peak memory. The full pipeline requires ~7.8GB on an A100 GPU. Deployment on edge devices or CPU-only environments requires model distillation or quantization.

Reference

Paper: Efficient Table Retrieval and Understanding with Multimodal Large Language Models (Xu et al., EACL 2026 Findings)

What to look for: Section 3 for the three-stage architecture details, Table 3 for ablation results showing each stage's contribution, and Section 4.6 for computational cost analysis (57ms retrieval / 810ms reranking / 520ms generation).