neqsim-technical-document-reading - SKILL.md Agent Skill

name: neqsim-technical-document-reading description: "Reads and extracts structured engineering data from technical documents (PDFs, Word, Excel, CSV) and engineering images/drawings (P&IDs, vendor datasheets, mechanical arrangements, performance maps). USE WHEN: a user provides engineering documents or images — equipment data sheets, technical requirements, design basis, well test reports, P&ID descriptions, inspection reports, standards, vendor drawings, compressor maps, phase envelopes, material certificates, trapped-liquid fire rupture evidence packs, or water-hammer route/event evidence — and needs structured data for process simulation. Covers document classification, extraction patterns by document type, image/figure analysis with view_image, unit normalization, data quality scoring, and output formats." last_verified: "2026-07-04"

Technical Document Reading Skill

Extract structured engineering data from technical documents and convert it into formats usable by process simulation, mechanical design, and engineering analysis.

Core Principle

Classify → Extract → Normalize → Validate → Output

Classify the document type to select the right extraction strategy

Extract raw data using format-specific tools (PDF, Word, Excel)

Normalize units, component names, and field names to standard conventions

Validate extracted data against physical bounds and completeness checks

Output structured JSON/dict for downstream consumption

For P&ID-driven operational studies, also load neqsim-pid-process-operations. That skill converts symbols into a process graph, classifies valve functions, maps instrument bubbles to historian tags, and defines steady-state or dynamic NeqSim scenario actions.

For water-hammer or liquid-hammer screening, also load neqsim-water-hammer. Extract route geometry, wall thickness, roughness/piping class, fittings, valve closure timing, design pressure, and tagreader event references so the output can feed WaterHammerStudy or MCP runWaterHammer.

For trapped-liquid fire rupture studies, also load neqsim-trapped-liquid-fire-rupture. Extract segment boundaries, line numbers, pipe geometry, material grade/certificate data, flange/gasket/bolt ratings, fire/PFP basis, relief availability, acceptance criteria, and explicit evidence gaps before handing data to the solver.

1. Document Type Classification

Classification Decision Tree

When a document is provided, classify it first:

Document Type	Identifying Features	Key Data to Extract
Equipment Data Sheet	API/ASME header, tag numbers, design conditions table	Design T/P, materials, dimensions, nozzle schedule
Technical Requirement (TR)	Numbered clauses, "shall"/"should" language, design factors	Design rules, factors, material constraints, limits
Design Basis	Operating envelope, fluid composition tables, flow scheme description	Compositions, T/P ranges, flow rates, design life
Heat & Mass Balance	Stream table with T, P, flow, composition columns	Stream conditions, equipment duties, compositions
Well Test Report	Choke sizes, flow rates vs time, GOR, water cut	Flow rates, compositions, reservoir P/T, productivity
P&ID / PFD Description	Equipment tags, line numbers, instrument tags, connectivity	Topology, instrumentation, control philosophy
Piping Specification	Material classes, pressure ratings, wall schedules	Pipe sizes, materials, ratings, corrosion allowance
Line List / Piping Route Table	Tabular line numbers, NPS, schedule, from/to nodes, straight length, fittings, valves, elevations	Route segments for `PipingRouteBuilder`: segment id, nodes, internal diameter, wall thickness, length, elevation, K values, source refs
Water-Hammer Evidence Pack	STID/P&ID route, valve data sheet, pump trip log, tagreader pressure/flow/valve-position event window	Route segments, design pressure, event schedule, closure/trip timing, field-data overrides for `runWaterHammer`
Trapped Liquid Fire Rupture Evidence Pack	Combination of P&IDs/STIDs, line lists, piping specs, material certificates, flange/gasket data, fire/PFP documents, and relief basis	Segment id, isolation boundary, trapped volume inputs, material grade, flange class, fire exposure, PFP endurance, relief availability, acceptance criteria, evidence gaps
Inspection Report	Thickness measurements, corrosion rates, anomaly locations	Wall thickness, corrosion rate, remaining life
Material Certificate	Heat numbers, SMYS/SMTS, chemical composition, Charpy values	Mechanical properties, chemical analysis, grade
Standards Document	ISO/API/ASME/DNV/NORSOK header, normative references	Design formulas, factors, limits, test requirements
Vendor Quotation	Equipment specs, pricing, delivery, performance curves	Performance data, dimensions, weight, cost
Operating Procedure	Step-by-step instructions, setpoints, alarm limits	Operating envelope, control setpoints, trip values
Engineering Drawing (P&ID)	Piping & instrumentation diagram — lines, valves, instruments, tags visually laid out	Equipment topology, valve tags, instrument tags, line sizes, piping class, connection types
Mechanical Arrangement Drawing	GA/section/plan views with dimensions, elevations, nozzle locations	Physical dimensions, elevations, nozzle orientations, piping run lengths, standpipe geometry
Vendor API Datasheet (image)	API 610/614/617/692 format data sheets rendered as images in PDFs	Seal type, operating conditions, leakage rates, gas supply requirements, material specs
Performance Map / Curve	Compressor maps, pump curves, phase envelopes, operating windows	Head vs flow, efficiency, surge line, operating point, cricondentherm/bar
Process Flow Diagram (image)	Visual PFD with equipment symbols, stream arrows, T/P/flow annotations	Equipment sequence, stream conditions, control valves, key operating parameters

Auto-Detection Heuristics

When document type isn't stated, look for these patterns:

DOCUMENT_SIGNATURES = {
    "equipment_data_sheet": [
        r"(?i)data\s*sheet", r"(?i)tag\s*no", r"(?i)design\s+press",
        r"(?i)operating\s+press", r"(?i)nozzle\s+schedule"
    ],
    "technical_requirement": [
        r"(?i)technical\s+requirement", r"(?i)\bTR[\s-]?\d+",
        r"(?i)\bshall\b.*\bdesign", r"(?i)design\s+factor"
    ],
    "design_basis": [
        r"(?i)design\s+basis", r"(?i)basis\s+of\s+design",
        r"(?i)operating\s+envelope", r"(?i)ambient\s+conditions"
    ],
    "heat_mass_balance": [
        r"(?i)heat\s+.*mass\s+balance", r"(?i)stream\s+(table|summary)",
        r"(?i)mol\s*%|mole\s+frac", r"(?i)vapour\s+fraction"
    ],
    "well_test": [
        r"(?i)well\s+test", r"(?i)choke\s+size", r"(?i)GOR",
        r"(?i)productivity\s+index", r"(?i)IPR"
    ],
    "inspection_report": [
        r"(?i)inspection\s+report", r"(?i)wall\s+thickness",
        r"(?i)corrosion\s+rate", r"(?i)remaining\s+life"
    ],
    "material_certificate": [
        r"(?i)material\s+cert", r"(?i)heat\s+number",
        r"(?i)SMYS|SMTS", r"(?i)charpy|impact\s+test"
    ],
    "piping_spec": [
        r"(?i)piping\s+class", r"(?i)material\s+class",
        r"(?i)line\s+class", r"(?i)pressure\s+rating"
    ],
    "line_list_route_table": [
        r"(?i)line\s*list", r"(?i)route\s*table",
        r"(?i)stress\s*iso(metric)?", r"(?i)from\s+node.*to\s+node",
        r"(?i)nominal\s+(size|diameter)", r"(?i)straight\s+length"
    ],
    "trapped_liquid_fire_rupture": [
        r"(?i)trapped\s+liquid", r"(?i)blocked[-\s]?in\s+liquid",
        r"(?i)thermal\s+expansion", r"(?i)fire\s+exposure",
        r"(?i)no\s+pressure\s+relief", r"(?i)passive\s+fire\s+protection|PFP",
        r"(?i)flange\s+(failure|leak|rupture)", r"(?i)pipe\s+rupture"
    ],
    "engineering_drawing_pid": [
        r"(?i)piping\s+.*instrument.*diagram", r"(?i)P&ID",
        r"(?i)P\s*&\s*I\s*D", r"(?i)instrument\s+diagram"
    ],
    "mechanical_arrangement": [
        r"(?i)general\s+arrangement", r"(?i)GA\s+drawing",
        r"(?i)section\s+view", r"(?i)plan\s+view",
        r"(?i)mechanical\s+arrangement", r"(?i)elevation\s+view"
    ],
    "vendor_api_datasheet_image": [
        r"(?i)API\s+6[19][0247]", r"(?i)dry\s+gas\s+seal",
        r"(?i)seal\s+gas", r"(?i)vendor\s+data\s*sheet",
        r"(?i)mechanical\s+seal", r"(?i)compressor\s+data\s*sheet"
    ],
    "performance_map": [
        r"(?i)performance\s+(map|curve)", r"(?i)compressor\s+map",
        r"(?i)pump\s+curve", r"(?i)phase\s+envelope",
        r"(?i)head\s+vs\s+flow", r"(?i)surge\s+line",
        r"(?i)operating\s+point", r"(?i)cricondentherm"
    ],
}

Trapped-Liquid Fire Rupture Extraction Schema

When the downstream task is trapped liquid, blocked-in liquid, fire rupture, PFP demand, or no pressure relief on liquid-filled piping, build this study input block:

{
    "study_type": "trapped_liquid_fire_rupture",
    "segments": [
        {
            "segment_id": "...",
            "line_numbers": ["..."],
            "isolation_boundary": {
                "upstream": {"tag": "...", "source": "..."},
                "downstream": {"tag": "...", "source": "..."},
                "vents_drains_relief_paths": []
            },
            "pipe_geometry": {
                "internal_diameter_m": {"value": null, "source": "...", "confidence": 0.0},
                "wall_thickness_m": {"value": null, "source": "...", "confidence": 0.0},
                "exposed_length_m": {"value": null, "source": "...", "confidence": 0.0}
            },
            "material": {
                "grade": {"value": null, "source": "..."},
                "smys_MPa": {"value": null, "source": "..."},
                "smts_MPa": {"value": null, "source": "..."}
            },
            "flange_gasket_bolt": {
                "class": {"value": null, "source": "..."},
                "gasket_type": {"value": null, "source": "..."},
                "temperature_rating_source": "..."
            },
            "fluid_and_conditions": {
                "fluid_description": "...",
                "composition": [],
                "operating_pressure_bara": {"value": null, "source": "..."},
                "operating_temperature_C": {"value": null, "source": "..."}
            },
            "fire_and_pfp": {
                "fire_type": "api521_pool_fire | fixed_heat_flux | radiative_fire | unknown",
                "heat_flux_W_m2": {"value": null, "source": "..."},
                "pfp_required_endurance_s": {"value": null, "source": "..."}
            },
            "relief_and_acceptance": {
                "relief_available": {"value": null, "source": "..."},
                "relief_set_pressure_bara": {"value": null, "source": "..."},
                "acceptance_criteria": []
            },
            "evidence_gaps": []
        }
    ]
}

Extraction priority for conflicting values: material certificate > piping spec > line list > P&ID annotation > narrative report > inferred/default. Always report conflicts rather than silently choosing one value.

2. File Format Handlers

2.1 PDF Extraction

Recommended library: pdfplumber (table-aware), fallback to pymupdf (fitz)

import pdfplumber

def extract_pdf(filepath):
    """Extract text and tables from a PDF document."""
    pages = []
    tables = []
    with pdfplumber.open(filepath) as pdf:
        for i, page in enumerate(pdf.pages):
            # Extract text
            text = page.extract_text() or ""
            pages.append({"page": i + 1, "text": text})

            # Extract tables (pdfplumber detects table boundaries)
            for table in page.extract_tables():
                if table and len(table) > 1:
                    headers = [str(h).strip() if h else "" for h in table[0]]
                    rows = []
                    for row in table[1:]:
                        rows.append([str(c).strip() if c else "" for c in row])
                    tables.append({
                        "page": i + 1,
                        "headers": headers,
                        "rows": rows
                    })
    return {"pages": pages, "tables": tables}

Extracting figures and diagrams from PDFs (PREFERRED):

Use devtools/pdf_to_figures.py to convert PDF pages to PNG images for visual analysis by AI tools (view_image). This is the fastest way to inspect engineering drawings, P&IDs, charts, tables, and compressor maps:

# CLI — convert all PDFs in the references folder
python devtools/pdf_to_figures.py step1_scope_and_research/references/ --outdir figures/

# CLI — specific pages only (1-indexed)
python devtools/pdf_to_figures.py path/to/document.pdf --pages 1 3 5 --outdir figures/

# Python / Notebook — programmatic use
from devtools.pdf_to_figures import pdf_to_pngs, pdf_folder_to_pngs

pngs = pdf_to_pngs("references/compressor_sketch.pdf", outdir="figures/")
all_pngs = pdf_folder_to_pngs("step1_scope_and_research/references/", outdir="figures/")

Then use view_image on the extracted PNGs to read diagrams, extract data from charts, identify equipment layouts, and analyze engineering drawings. Requires pymupdf.

Handling scanned PDFs:

If extract_text() returns empty, the PDF is likely scanned/image-based
First try pdf_to_figures.py to render pages as images, then use view_image for AI reading
Use OCR as fallback: pytesseract + pdf2image, or run devtools/pdf_ocr.py (auto-detects scanned PDFs and falls back to OCR — see the neqsim-pdf-ocr skill for P&ID tag extraction and OCRmyPDF usage)
Flag to user: "This appears to be a scanned document. OCR extraction may have errors."

2.2 Word Document Extraction

Library: python-docx

from docx import Document

def extract_docx(filepath):
    """Extract paragraphs and tables from a Word document."""
    doc = Document(filepath)
    paragraphs = []
    tables = []

    for para in doc.paragraphs:
        if para.text.strip():
            paragraphs.append({
                "text": para.text.strip(),
                "style": para.style.name,
                "level": _heading_level(para.style.name)
            })

    for i, table in enumerate(doc.tables):
        headers = [cell.text.strip() for cell in table.rows[0].cells]
        rows = []
        for row in table.rows[1:]:
            rows.append([cell.text.strip() for cell in row.cells])
        tables.append({"index": i, "headers": headers, "rows": rows})

    return {"paragraphs": paragraphs, "tables": tables}


def _heading_level(style_name):
    """Return heading level (1-9) or 0 for body text."""
    if style_name.startswith("Heading"):
        try:
            return int(style_name.split()[-1])
        except ValueError:
            return 0
    return 0

2.3 Excel / CSV Extraction

Libraries: openpyxl for .xlsx, pandas for both

import pandas as pd

def extract_excel(filepath, sheet_name=None):
    """Extract tables from Excel workbook."""
    sheets = {}
    xls = pd.ExcelFile(filepath)
    target_sheets = [sheet_name] if sheet_name else xls.sheet_names

    for name in target_sheets:
        df = pd.read_excel(xls, sheet_name=name)
        # Drop fully empty rows/columns
        df = df.dropna(how="all").dropna(axis=1, how="all")
        sheets[name] = {
            "headers": list(df.columns),
            "rows": df.values.tolist(),
            "shape": list(df.shape)
        }
    return sheets


def extract_csv(filepath):
    """Extract table from CSV file."""
    df = pd.read_csv(filepath)
    df = df.dropna(how="all").dropna(axis=1, how="all")
    return {
        "headers": list(df.columns),
        "rows": df.values.tolist(),
        "shape": list(df.shape)
    }

3. Extraction Patterns by Document Type

3.1 Equipment Data Sheet

Equipment data sheets follow standard API/ISO layouts. Key sections:

DATASHEET_SECTIONS = {
    "general": {
        "patterns": [
            r"(?i)tag\s*(?:no|number|#)[\s:]+(\S+)",
            r"(?i)service[\s:]+(.+?)(?:\n|$)",
            r"(?i)quantity[\s:]+(\d+)",
            r"(?i)type[\s:]+(.+?)(?:\n|$)",
        ],
        "fields": ["tag_number", "service", "quantity", "equipment_type"]
    },
    "design_conditions": {
        "patterns": [
            r"(?i)design\s+press(?:ure)?[\s:]+([0-9.]+)\s*(bar|bara|barg|psi|psig|MPa|kPa)",
            r"(?i)design\s+temp(?:erature)?[\s:]+([0-9.+-]+)\s*(°?[CFK]|degC|degF)",
            r"(?i)oper(?:ating)?\s+press(?:ure)?[\s:]+([0-9.]+)\s*(bar|bara|barg|psi|psig|MPa|kPa)",
            r"(?i)oper(?:ating)?\s+temp(?:erature)?[\s:]+([0-9.+-]+)\s*(°?[CFK]|degC|degF)",
        ],
        "fields": ["design_pressure", "design_temperature",
                    "operating_pressure", "operating_temperature"]
    },
    "dimensions": {
        "patterns": [
            r"(?i)(?:ID|inner\s+diameter|internal\s+diameter)[\s:]+([0-9.]+)\s*(mm|m|in|inch|ft)",
            r"(?i)(?:OD|outer\s+diameter|outside\s+diameter)[\s:]+([0-9.]+)\s*(mm|m|in|inch|ft)",
            r"(?i)length[\s:]+([0-9.]+)\s*(mm|m|ft)",
            r"(?i)(?:t-t|tan.*tan|seam.*seam|height)[\s:]+([0-9.]+)\s*(mm|m|ft)",
            r"(?i)wall\s+thick(?:ness)?[\s:]+([0-9.]+)\s*(mm|in|inch)",
        ],
        "fields": ["inner_diameter", "outer_diameter", "length",
                    "height", "wall_thickness"]
    },
    "material": {
        "patterns": [
            r"(?i)(?:shell|body)\s+material[\s:]+(\S+(?:\s+\S+)?)",
            r"(?i)material\s+grade[\s:]+(\S+)",
            r"(?i)(?:SA|ASTM|A)\s*-?\s*(\d+)(?:\s*-?\s*(?:Gr\.?\s*)?(\S+))?",
        ],
        "fields": ["shell_material", "material_grade"]
    }
}

Table extraction for data sheets:

def extract_datasheet_table(table):
    """Parse an equipment data sheet table into structured fields.

    Handles common layouts where col 0 is field name, col 1+ are values.
    Also handles multi-case tables (normal/upset/design columns).
    """
    result = {}
    headers = table["headers"]
    for row in table["rows"]:
        if len(row) >= 2 and row[0]:
            field_name = row[0].strip().lower()
            # Multi-case: create dict with case names as keys
            if len(headers) > 2 and len(row) > 2:
                for i, header in enumerate(headers[1:], start=1):
                    if i < len(row) and row[i]:
                        case_key = header.strip() if header else f"case_{i}"
                        if field_name not in result:
                            result[field_name] = {}
                        result[field_name][case_key] = _parse_value(row[i])
            else:
                result[field_name] = _parse_value(row[1])
    return result


def _parse_value(text):
    """Parse a text value, attempting numeric conversion."""
    text = str(text).strip()
    if not text or text == "-" or text.lower() == "n/a":
        return None
    try:
        return float(text.replace(",", ""))
    except ValueError:
        return text

3.2 Heat & Mass Balance (Stream Table)

Stream tables are the most common source of process data. They vary in layout:

Layout A — Streams as columns:

Property	Stream 1	Stream 2	Stream 3
Temperature (°C)	40.0	15.0	120.0
Pressure (bara)	80.0	78.0	150.0

Layout B — Streams as rows:

Stream	T (°C)	P (bara)	Flow (kg/h)
Feed	40.0	80.0	75000

def extract_stream_table(table, layout="auto"):
    """Extract stream data from a heat & mass balance table.

    Args:
        table: dict with "headers" and "rows" keys
        layout: "columns" (streams as columns), "rows" (streams as rows), or "auto"

    Returns:
        dict mapping stream names to their properties
    """
    headers = table["headers"]
    rows = table["rows"]

    if layout == "auto":
        layout = _detect_stream_layout(headers, rows)

    streams = {}

    if layout == "columns":
        # First column is property name, remaining columns are streams
        stream_names = [h.strip() for h in headers[1:] if h and h.strip()]
        for row in rows:
            if not row[0]:
                continue
            prop_name, prop_unit = _parse_property_header(row[0])
            for i, stream_name in enumerate(stream_names):
                col_idx = i + 1
                if col_idx < len(row) and row[col_idx]:
                    if stream_name not in streams:
                        streams[stream_name] = {}
                    streams[stream_name][prop_name] = {
                        "value": _parse_value(row[col_idx]),
                        "unit": prop_unit
                    }
    else:
        # First column (or first few) is stream name, rest are properties
        for row in rows:
            if not row[0]:
                continue
            stream_name = str(row[0]).strip()
            streams[stream_name] = {}
            for j, header in enumerate(headers[1:], start=1):
                if j < len(row) and row[j]:
                    prop_name, prop_unit = _parse_property_header(header)
                    streams[stream_name][prop_name] = {
                        "value": _parse_value(row[j]),
                        "unit": prop_unit
                    }

    return streams


def _detect_stream_layout(headers, rows):
    """Heuristic: if first header looks like a property label, it's column layout."""
    first_header = str(headers[0]).strip().lower()
    property_keywords = ["property", "parameter", "description", "item",
                         "stream", "temperature", "pressure", "flow"]
    if any(kw in first_header for kw in property_keywords[:4]):
        return "columns"
    return "rows"


def _parse_property_header(text):
    """Split 'Temperature (°C)' into ('temperature', '°C')."""
    import re
    text = str(text).strip()
    match = re.match(r"(.+?)\s*[\(\[]([^\)\]]+)[\)\]]", text)
    if match:
        return match.group(1).strip().lower(), match.group(2).strip()
    return text.lower(), ""

3.3 Fluid Composition Tables

# Common header variations for composition columns
COMPOSITION_HEADERS = {
    "component": [
        r"(?i)component", r"(?i)species", r"(?i)compound", r"(?i)name"
    ],
    "mole_fraction": [
        r"(?i)mol\s*%", r"(?i)mole\s+frac", r"(?i)mol\s+frac",
        r"(?i)y\s*[\(\[]", r"(?i)x\s*[\(\[]", r"(?i)z\s*[\(\[]",
        r"(?i)composition\s*\(mol", r"(?i)molar"
    ],
    "mass_fraction": [
        r"(?i)wt\s*%", r"(?i)mass\s+frac", r"(?i)weight\s+frac",
        r"(?i)mass\s*%"
    ],
    "volume_fraction": [
        r"(?i)vol\s*%", r"(?i)volume\s+frac", r"(?i)liq\s+vol\s*%"
    ]
}

# Component name mapping (common variations → NeqSim names)
COMPONENT_NAME_MAP = {
    # Hydrocarbons
    "c1": "methane", "ch4": "methane", "methane": "methane",
    "c2": "ethane", "c2h6": "ethane", "ethane": "ethane",
    "c3": "propane", "c3h8": "propane", "propane": "propane",
    "ic4": "i-butane", "i-c4": "i-butane", "isobutane": "i-butane",
    "nc4": "n-butane", "n-c4": "n-butane",
    "ic5": "i-pentane", "i-c5": "i-pentane", "isopentane": "i-pentane",
    "nc5": "n-pentane", "n-c5": "n-pentane",
    "nc6": "n-hexane", "n-c6": "n-hexane", "hexane": "n-hexane",
    "nc7": "n-heptane", "n-c7": "n-heptane", "heptane": "n-heptane",
    "nc8": "n-octane", "n-c8": "n-octane", "octane": "n-octane",
    "nc9": "n-nonane", "n-c9": "n-nonane",
    "nc10": "nC10", "n-c10": "nC10",
    "cyclohexane": "cyclohexane", "c-c6": "cyclohexane",
    "benzene": "benzene", "toluene": "toluene",
    # Non-hydrocarbons
    "co2": "CO2", "carbon dioxide": "CO2",
    "h2s": "H2S", "hydrogen sulfide": "H2S", "hydrogen sulphide": "H2S",
    "n2": "nitrogen", "nitrogen": "nitrogen",
    "h2": "hydrogen", "hydrogen": "hydrogen",
    "h2o": "water", "water": "water",
    "o2": "oxygen", "oxygen": "oxygen",
    "ar": "argon", "argon": "argon",
    "he": "helium", "helium": "helium",
    "co": "CO",
    # Glycols
    "meg": "MEG", "monoethylene glycol": "MEG",
    "deg": "DEG", "diethylene glycol": "DEG",
    "teg": "TEG", "triethylene glycol": "TEG",
    # Mercaptans
    "methyl mercaptan": "methyl-mercaptan", "ch3sh": "methyl-mercaptan",
    "ethyl mercaptan": "ethyl-mercaptan", "c2h5sh": "ethyl-mercaptan",
}


def normalize_component_name(raw_name):
    """Map a raw component name to the NeqSim standard name."""
    key = raw_name.strip().lower().replace("_", " ").replace("-", " ")
    # Direct lookup
    if key in COMPONENT_NAME_MAP:
        return COMPONENT_NAME_MAP[key]
    # Try removing spaces
    key_nospace = key.replace(" ", "")
    if key_nospace in COMPONENT_NAME_MAP:
        return COMPONENT_NAME_MAP[key_nospace]
    # Return original (user may need to map manually)
    return raw_name.strip()


def extract_composition(table):
    """Extract fluid composition from a table.

    Returns dict of {neqsim_name: mole_fraction} and metadata about
    the composition basis (mole%, mass%, etc.).
    """
    import re
    headers = table["headers"]

    # Find component column and value column
    comp_col = None
    value_col = None
    basis = "mole_fraction"

    for i, h in enumerate(headers):
        h_str = str(h).strip()
        for pattern in COMPOSITION_HEADERS["component"]:
            if re.search(pattern, h_str):
                comp_col = i
                break
        for basis_type in ["mole_fraction", "mass_fraction", "volume_fraction"]:
            for pattern in COMPOSITION_HEADERS[basis_type]:
                if re.search(pattern, h_str):
                    value_col = i
                    basis = basis_type
                    break

    # Fallback: assume col 0 = component, col 1 = value
    if comp_col is None:
        comp_col = 0
    if value_col is None:
        value_col = 1

    composition = {}
    for row in table["rows"]:
        if comp_col < len(row) and value_col < len(row):
            name = str(row[comp_col]).strip()
            value = _parse_value(row[value_col])
            if name and value is not None:
                neqsim_name = normalize_component_name(name)
                composition[neqsim_name] = float(value)

    # Normalize: if values sum to ~100, convert to fractions
    total = sum(composition.values())
    if total > 1.5:  # Likely percentages
        composition = {k: v / 100.0 for k, v in composition.items()}
        total = sum(composition.values())

    return {
        "components": composition,
        "basis": basis,
        "total": total,
        "normalized": abs(total - 1.0) < 0.02
    }

3.4 Technical Requirements

TR documents contain design rules expressed as requirements. Extract:

def extract_requirements(text):
    """Extract numbered requirements from a technical requirements document.

    Looks for 'shall' and 'should' statements with associated values.
    """
    import re
    requirements = []

    # Pattern: numbered clause + requirement text
    clause_pattern = re.compile(
        r"(\d+(?:\.\d+)*)\s+(.*?(?:shall|should|must|may)\s+.*?)(?=\n\d+(?:\.\d+)*\s|\Z)",
        re.IGNORECASE | re.DOTALL
    )

    for match in clause_pattern.finditer(text):
        clause_num = match.group(1)
        req_text = match.group(2).strip()

        # Extract numeric values with units
        values = re.findall(
            r"(\d+(?:\.\d+)?)\s*(bar[ag]?|°?[CFK]|mm|m|kg|psi|MPa|kPa|%|hr|min)",
            req_text
        )

        # Classify requirement type
        if re.search(r"(?i)shall\b", req_text):
            level = "mandatory"
        elif re.search(r"(?i)should\b", req_text):
            level = "recommended"
        else:
            level = "informative"

        requirements.append({
            "clause": clause_num,
            "text": req_text,
            "level": level,
            "values": [{"value": float(v), "unit": u} for v, u in values]
        })

    return requirements

3.5 Well Test Reports

WELL_TEST_FIELDS = {
    "flow_rate": [
        r"(?i)(?:oil|gas|liquid|water)\s+(?:flow\s+)?rate[\s:]+([0-9.,]+)\s*(Sm3/d|bbl/d|MMSCFD|kg/hr|m3/d)",
    ],
    "gor": [
        r"(?i)GOR[\s:]+([0-9.,]+)\s*(Sm3/Sm3|scf/bbl|m3/m3)",
    ],
    "water_cut": [
        r"(?i)water\s*cut[\s:]+([0-9.,]+)\s*(%)?",
        r"(?i)BSW[\s:]+([0-9.,]+)\s*(%)?",
    ],
    "reservoir_pressure": [
        r"(?i)reservoir\s+press(?:ure)?[\s:]+([0-9.,]+)\s*(bar[ag]?|psi[ag]?|MPa|kPa)",
    ],
    "reservoir_temperature": [
        r"(?i)reservoir\s+temp(?:erature)?[\s:]+([0-9.,]+)\s*(°?[CFK])",
    ],
    "wellhead_pressure": [
        r"(?i)(?:WHP|wellhead\s+press(?:ure)?)[\s:]+([0-9.,]+)\s*(bar[ag]?|psi[ag]?|MPa)",
    ],
    "wellhead_temperature": [
        r"(?i)(?:WHT|wellhead\s+temp(?:erature)?)[\s:]+([0-9.,]+)\s*(°?[CFK])",
    ],
    "choke_size": [
        r"(?i)choke[\s:]+([0-9.,]+)(?:/64)?[\s]*(mm|inch|in|/64)",
    ],
    "productivity_index": [
        r"(?i)(?:PI|productivity\s+index)[\s:]+([0-9.,]+)\s*(Sm3/d/bar|bbl/d/psi)",
    ],
}

3.6 Inspection Reports

def extract_thickness_data(table):
    """Extract wall thickness measurements from an inspection table.

    Common format: Location | Nominal (mm) | Measured (mm) | Min (mm) | Corrosion Rate
    """
    result = []
    for row in table["rows"]:
        entry = {}
        for i, header in enumerate(table["headers"]):
            h = str(header).strip().lower()
            if i < len(row) and row[i]:
                val = _parse_value(row[i])
                if "location" in h or "position" in h or "point" in h:
                    entry["location"] = str(row[i]).strip()
                elif "nominal" in h or "original" in h:
                    entry["nominal_mm"] = val
                elif "measured" in h or "actual" in h or "remaining" in h:
                    entry["measured_mm"] = val
                elif "min" in h:
                    entry["minimum_mm"] = val
                elif "corrosion" in h and "rate" in h:
                    entry["corrosion_rate_mm_yr"] = val
        if entry:
            result.append(entry)
    return result

3.7 Image and Figure Analysis (Engineering Drawings, P&IDs, Performance Maps)

Many engineering documents contain critical data embedded in images rather than text — P&IDs, mechanical arrangement drawings, vendor datasheets rendered as scanned PDFs, compressor performance maps, and phase envelopes. Use the view_image tool for multimodal analysis of these visual documents.

3.7.1 Workflow: Image-Based Document Analysis

PDF/Image Document → pdf_to_figures.py → PNG pages → view_image → Structured Data

Step 1: Convert PDF pages to images

# Single file
python devtools/pdf_to_figures.py path/to/document.pdf --outdir figures/ --dpi 200

# Specific pages
python devtools/pdf_to_figures.py document.pdf --outdir figures/ --pages 1,3,5

# Batch (all PDFs in a folder)
python devtools/pdf_to_figures.py references/ --outdir figures/

Step 2: View and analyze each page with view_image

Use view_image on each extracted PNG. For each image, systematically extract:

Title block — document number, revision, title, date, originator
Equipment tags — vessel tags (V-xxx), pump tags (P-xxx), compressor tags (BCL-xxx)
Instrument tags — transmitters (PT-, TT-, LT-, FT-), valves (XV-, PV-, LV-)
Piping information — line sizes, piping class, connection types (flanged, welded)
Operating conditions — T, P, flow annotations on streams
Notes and legends — design notes, material callouts, hold/revision clouds
Dimensional data — lengths, diameters, elevations (for GA drawings)

Step 3: Structure the extracted data

Convert visual observations into the standard extraction result JSON format (Section 6.1), adding an image_analysis block.

3.7.2 Engineering Drawing Types and Extraction Patterns

Drawing Type	What to Look For	Key Data to Extract
P&ID (Piping & Instrumentation)	Equipment symbols, instrument bubbles, line numbers, valve symbols, piping class annotations	Equipment tags + types, valve tags + types (gate/globe/ball/check), instrument tags + functions, line numbers + sizes + ratings, control loops, interlock references
Line List / Stress Isometric	Rows with line numbers, from/to nodes, NPS/schedule, lengths, elevations, fittings, valves	Ordered route segments for `PipingRouteBuilder`, including internal diameter assumptions, K values, and source refs
Mechanical Arrangement (GA)	Plan/section/elevation views, dimension lines, nozzle positions, structural supports	Overall dimensions (L×W×H), nozzle sizes + orientations, standpipe lengths + bore sizes, access platform locations, foundation bolt patterns
Vendor API Datasheet (image)	API format tables (610/614/617/692), operating conditions, seal/bearing details	Design T/P, seal type, shaft speed, gas supply requirements, leakage rates, material specs, utility requirements
Compressor/Pump Performance Map	Head vs flow curves, efficiency lines, surge line, stonewall, rated point marker	Rated head/flow/power, surge point, maximum continuous speed, efficiency at rated/off-design, operating window boundaries
Phase Envelope	Bubble/dew point curves, cricondentherm, cricondenbar, operating path markers, quality lines	Cricondentherm (°C), cricondenbar (bara), critical point (T,P), operating point position relative to envelope, two-phase region extent
Process Flow Diagram (PFD)	Equipment blocks, stream arrows, T/P/flow labels, utility connections	Equipment sequence, stream T/P/flow, utility duties, recycle loops, control valve locations
Seal Gas / Utility Piping Schematic	Small-bore piping, filters, orifices, check valves, vent/drain connections	Flow path topology, pipe sizes (often 3/8"–1"), filter positions, orifice sizes, vent/drain locations, instrumentation (dP transmitters, pressure gauges)

3.7.3 P&ID Data Extraction Pattern

When reading a P&ID image with view_image, extract data in this structured format:

PID_EXTRACTION = {
    "document": {
        "drawing_number": "SOK-xxxxxx",
        "revision": "C2",
        "title": "Oil/Seal Gas Piping — Compressor BCL-304/D",
        "area": "26",
        "unit": "GIC Compressor"
    },
    "equipment": [
        {"tag": "BCL-304/D", "type": "Centrifugal Compressor", "service": "Gas Injection"},
        {"tag": "FLT-26110", "type": "Coalescing Filter", "service": "Seal Gas Supply"},
    ],
    "valves": [
        {"tag": "VB26-0130", "type": "Ball Valve", "size_inch": 0.75,
         "class": "2500 RTJ", "service": "Seal Gas Supply Isolation"},
        {"tag": "XV-S26102.05", "type": "Shutdown Valve",
         "service": "Primary Seal Gas", "normally": "open"},
    ],
    "instruments": [
        {"tag": "PT-S26102.03", "type": "Pressure Transmitter",
         "service": "Seal Gas Supply Pressure", "range": "0-200 barg"},
        {"tag": "TP-S26102.11", "type": "Temperature Point",
         "service": "Primary Vent Temperature"},
        {"tag": "FT-S26102.01", "type": "Flow Transmitter",
         "service": "Seal Gas Flow"},
    ],
    "piping": [
        {"line_number": "26-G1H-001", "size_inch": 0.75,
         "piping_class": "G1H", "rating": "2500#",
         "from": "FLT-26110", "to": "BCL-304/D NDE Seal"},
    ],
    "connections": [
        {"from": "FLT-26110.outlet", "to": "BCL-304/D.seal_gas_inlet",
         "type": "seal_gas_supply"},
        {"from": "BCL-304/D.primary_vent", "to": "VB26-0180",
         "type": "seal_vent", "destination": "Flare/Vent"},
    ],
    "route_segments": [
        {"segment_id": "26-G1H-001-S1", "from_node": "FLT-26110", "to_node": "XV-S26102.05",
         "line_number": "26-G1H-001", "size_inch": 0.75,
         "internal_diameter_m": 0.019, "length_m": 4.5,
         "elevation_change_m": 0.0,
         "minor_losses": [{"type": "ball valve", "k_value": 0.05}],
         "source_ref": "P&ID page 1, line 26-G1H-001"}
    ]
}

For route pressure-drop work, pass route_segments to neqsim.process.equipment.pipeline.routing.PipingRouteBuilder and save route.toJson() with the task results.

For operational studies, also create a pid_operational_model block as defined in neqsim-pid-process-operations. Include symbol semantics, valve normal and fail positions when stated, control links, drains, vents, bypasses, check valves, and logical tag names for plant data binding. Flag uncertain symbol readings; do not infer live valve state from drawing normal position alone.

3.7.4 Vendor API Datasheet Image Extraction

For vendor datasheets rendered as images (common for API 692 seal datasheets, API 617 compressor datasheets):

VENDOR_DATASHEET_EXTRACTION = {
    "equipment_tag": "BCL-304/D",
    "vendor": "John Crane / Flowserve / EagleBurgmann",
    "api_standard": "API 692",
    "data_sections": {
        "operating_conditions": {
            "suction_pressure_barg": 125.0,
            "discharge_pressure_barg": 190.0,
            "seal_gas_supply_pressure_barg": 155.0,
            "operating_temperature_C": 45.0,
            "shaft_speed_rpm": 11000,
        },
        "seal_data": {
            "seal_type": "Tandem dry gas seal",
            "seal_arrangement": "API Plan 74 (primary) + API Plan 76 (secondary)",
            "primary_leakage_nm3hr": 12.0,
            "secondary_leakage_nm3hr": 2.0,
            "buffer_gas": "Nitrogen",
        },
        "gas_supply": {
            "required_flow_nm3hr": 25.0,
            "required_pressure_above_ref_barg": 3.5,
            "filtration_micron": 3,
            "gas_temperature_range_C": [-10, 60],
        },
        "materials": {
            "face_material": "Silicon Carbide",
            "seat_material": "Carbon",
            "o_ring_material": "Viton / FFKM",
            "spring_material": "Inconel 718",
        }
    }
}

3.7.5 Performance Map / Phase Envelope Extraction

When analyzing compressor maps or phase envelopes from images:

PERFORMANCE_MAP_EXTRACTION = {
    "chart_type": "compressor_map",  # or "phase_envelope", "pump_curve"
    "axes": {
        "x": {"label": "Inlet Volume Flow", "unit": "m3/hr"},
        "y": {"label": "Polytropic Head", "unit": "kJ/kg"}
    },
    "rated_point": {
        "flow": 5000, "head": 85.0, "efficiency_pct": 82.0, "speed_rpm": 11000
    },
    "surge_point": {
        "flow": 3200, "head": 92.0
    },
    "curves": [
        {"speed_rpm": 11000, "points": [
            {"flow": 3200, "head": 92.0},
            {"flow": 4000, "head": 89.0},
            {"flow": 5000, "head": 85.0},
            {"flow": 6000, "head": 78.0},
        ]},
    ],
    "operating_window": {
        "min_flow": 3500, "max_flow": 5800,
        "min_head": 70.0, "max_head": 95.0
    }
}

PHASE_ENVELOPE_EXTRACTION = {
    "chart_type": "phase_envelope",
    "axes": {
        "x": {"label": "Temperature", "unit": "C"},
        "y": {"label": "Pressure", "unit": "bar"}
    },
    "critical_point": {"temperature_C": -82.0, "pressure_bar": 46.0},
    "cricondentherm": {"temperature_C": -5.4, "pressure_bar": 38.0},
    "cricondenbar": {"temperature_C": -25.0, "pressure_bar": 55.0},
    "key_points": [
        {"label": "Operating Point", "temperature_C": 25.0, "pressure_bar": 45.0,
         "phase_region": "single_phase_gas"},
        {"label": "JT Outlet", "temperature_C": -20.0, "pressure_bar": 3.0,
         "phase_region": "single_phase_gas"},
    ],
    "retrograde_region": {
        "temperature_range_C": [-40, -5],
        "pressure_range_bar": [30, 55],
        "max_liquid_fraction_pct": 8.0
    }
}

3.7.6 Mechanical Arrangement Drawing Extraction

For GA drawings showing physical layout and dimensions:

MECHANICAL_ARRANGEMENT_EXTRACTION = {
    "drawing_type": "mechanical_arrangement",
    "equipment_tag": "BCL-304",
    "views": ["plan", "section_A-A", "elevation_north"],
    "overall_dimensions": {
        "length_mm": 4500,
        "width_mm": 2800,
        "height_mm": 3200
    },
    "nozzles": [
        {"tag": "N1", "service": "Suction", "size_inch": 16,
         "orientation": "horizontal", "elevation_mm": 1500},
        {"tag": "N2", "service": "Discharge", "size_inch": 12,
         "orientation": "horizontal", "elevation_mm": 1500},
        {"tag": "N5", "service": "Seal Gas Supply", "size_inch": 0.75,
         "orientation": "radial", "elevation_mm": 1800},
    ],
    "standpipes_and_drains": [
        {"service": "Primary Vent Drain", "od_inch": 1.5, "id_mm": 38,
         "length_m": 1.5, "volume_liters": 1.70,
         "orientation": "vertical_down", "low_point_elevation_mm": 200},
    ],
    "piping_runs": [
        {"from": "BCL-304.N5", "to": "FLT-26110", "pipe_size_inch": 0.75,
         "material": "SS316", "approx_length_m": 8.0}
    ]
}

3.7.7 Figure Discussion Generation Pattern

After extracting data from engineering figures during a task analysis, generate discussion blocks for the report. Each figure discussion follows this template:

FIGURE_DISCUSSION_TEMPLATE = {
    "figure": "filename.png",
    "title": "Descriptive Title of the Figure",
    "observation": "What the figure shows, with specific numbers and features identified.",
    "mechanism": "The underlying physical or engineering reason for what is observed.",
    "implication": "What this means for the design, operation, or safety assessment.",
    "recommendation": "Specific actionable engineering recommendation based on this figure.",
    "linked_results": ["key_result_1", "key_result_2"],
    "insight_question_ref": "Q3"
}

When to generate figure discussions:

Figure Source	When Discussion is Needed
Simulation output plot (T/P profiles, phase envelopes)	Always for decision-critical figures; proportional for others
Vendor datasheet image	When extracted data influences design decisions
P&ID image	When topology reveals flow paths critical to the analysis
Mechanical drawing	When dimensions affect calculations (e.g., standpipe volumes)
Performance map	When operating point proximity to limits is relevant
Benchmark comparison plot	Always — explains deviations

3.7.8 Best Practices for Image Analysis

Always use pdf_to_figures.py first — render PDF pages to PNG at 200+ DPI before attempting to read visual content. Direct text extraction misses drawings.
Systematic scanning — when viewing a P&ID or drawing with view_image, scan systematically: title block → equipment → instruments → piping → notes. Don't try to extract everything in one pass.
Cross-reference text and images — use text-extracted data (tables, narrative) to validate what you see in images. If a table says "3/4 inch" but the drawing annotation reads "1 inch", flag the conflict.
Dimensional extraction from drawings — when reading dimensions, note:
- Whether dimensions are NTS (Not To Scale) — most engineering drawings are
- Units (mm vs inches vs feet) — check the drawing notes/title block
- Reference datums — dimensions are relative; identify the reference points
Performance map reading — when digitizing curves from performance maps:
- Read axis scales carefully (linear vs log, units)
- Identify the rated/design point (usually marked with a symbol)
- Read at least 4-5 points per curve for reasonable interpolation
- Note surge/stonewall lines as system limits
Scanned/low-resolution images — if image quality is poor:
- Report which values are uncertain due to readability
- Assign lower confidence scores to those extractions
- Suggest re-scanning at higher resolution if values are critical
Multi-page drawings — large P&IDs often span multiple sheets (1/3, 2/3, 3/3). Track which equipment appears on which sheet and merge topology across sheets.

4. Unit Normalization

All extracted values must be normalized to consistent engineering units.

Standard Units (SI-based Engineering)

Quantity	Standard Unit	Symbol
Temperature	Kelvin	K
Pressure	bara	bara
Mass flow	kg/hr	kg/hr
Volumetric flow (gas)	Sm³/d	Sm3/d
Volumetric flow (liquid)	m³/hr	m3/hr
Length	m	m
Diameter	mm	mm
Wall thickness	mm	mm
Density	kg/m³	kg/m3
Viscosity	mPa·s (cP)	cP
Power	kW	kW
Heat duty	kW	kW
Heat transfer coeff.	W/(m²·K)	W/m2K
Thermal conductivity	W/(m·K)	W/mK

Conversion Functions

def convert_temperature(value, from_unit, to_unit="K"):
    """Convert temperature between C, F, K, R."""
    unit = from_unit.strip().replace("°", "").replace("deg", "").upper()
    # Convert to Kelvin first
    if unit in ("C", "CELSIUS"):
        kelvin = value + 273.15
    elif unit in ("F", "FAHRENHEIT"):
        kelvin = (value - 32) * 5.0 / 9.0 + 273.15
    elif unit in ("R", "RANKINE"):
        kelvin = value * 5.0 / 9.0
    else:
        kelvin = value  # Assume Kelvin

    if to_unit.upper() in ("C", "CELSIUS"):
        return kelvin - 273.15
    elif to_unit.upper() in ("F", "FAHRENHEIT"):
        return (kelvin - 273.15) * 9.0 / 5.0 + 32
    return kelvin


def convert_pressure(value, from_unit, to_unit="bara"):
    """Convert pressure to bara."""
    unit = from_unit.strip().lower()
    conversions_to_bara = {
        "bara": 1.0,
        "barg": lambda v: v + 1.01325,
        "bar": 1.0,  # Assume absolute unless 'g' suffix
        "psia": 0.0689476,
        "psig": lambda v: (v + 14.696) * 0.0689476,
        "psi": 0.0689476,  # Assume absolute
        "mpa": 10.0,
        "kpa": 0.01,
        "atm": 1.01325,
        "mmhg": 0.00133322,
        "torr": 0.00133322,
    }
    factor = conversions_to_bara.get(unit, 1.0)
    if callable(factor):
        return factor(value)
    return value * factor


def convert_length(value, from_unit, to_unit="mm"):
    """Convert length to mm."""
    unit = from_unit.strip().lower()
    to_mm = {
        "mm": 1.0, "cm": 10.0, "m": 1000.0, "km": 1e6,
        "in": 25.4, "inch": 25.4, "inches": 25.4,
        "ft": 304.8, "feet": 304.8, "foot": 304.8,
        "yd": 914.4,
    }
    mm_val = value * to_mm.get(unit, 1.0)
    to_target = {"mm": 1.0, "m": 0.001, "in": 1.0 / 25.4, "ft": 1.0 / 304.8}
    return mm_val * to_target.get(to_unit.lower(), 1.0)


def convert_flow(value, from_unit, to_unit="kg/hr"):
    """Convert mass/volumetric flow rates. Approximate — exact conversion needs density."""
    unit = from_unit.strip().lower().replace(" ", "")
    # Direct mass flow conversions to kg/hr
    mass_to_kghr = {
        "kg/hr": 1.0, "kg/h": 1.0, "kg/s": 3600.0,
        "t/hr": 1000.0, "t/h": 1000.0, "t/d": 1000.0 / 24.0,
        "lb/hr": 0.453592, "lb/h": 0.453592,
    }
    if unit in mass_to_kghr:
        return value * mass_to_kghr[unit]
    return value  # Return as-is with warning

5. Data Quality and Validation

5.1 Completeness Score

Rate extracted data on completeness for downstream simulation:

def score_extraction_quality(extracted_data, document_type):
    """Score the quality and completeness of extracted data (0-100)."""
    scores = {}

    if document_type == "design_basis":
        required = ["fluid_composition", "temperature", "pressure", "flow_rate"]
        optional = ["water_cut", "gor", "h2s_content", "co2_content"]
    elif document_type == "equipment_data_sheet":
        required = ["tag_number", "design_pressure", "design_temperature", "material"]
        optional = ["dimensions", "nozzle_schedule", "weight"]
    elif document_type == "heat_mass_balance":
        required = ["stream_names", "temperatures", "pressures", "flows"]
        optional = ["compositions", "densities", "viscosities"]
    else:
        required = []
        optional = []

    found_required = sum(1 for r in required if r in extracted_data and extracted_data[r])
    found_optional = sum(1 for o in optional if o in extracted_data and extracted_data[o])

    req_score = (found_required / max(len(required), 1)) * 70
    opt_score = (found_optional / max(len(optional), 1)) * 30

    return {
        "total_score": round(req_score + opt_score),
        "required_found": found_required,
        "required_total": len(required),
        "optional_found": found_optional,
        "optional_total": len(optional),
        "missing_required": [r for r in required if r not in extracted_data or not extracted_data[r]],
        "missing_optional": [o for o in optional if o not in extracted_data or not extracted_data[o]],
    }

5.2 Physical Bounds Validation

PHYSICAL_BOUNDS = {
    "temperature_K": (50.0, 2000.0),       # 50 K to 2000 K
    "pressure_bara": (0.001, 10000.0),      # Near vacuum to ultra-high
    "mole_fraction": (0.0, 1.0),
    "flow_kg_hr": (0.0, 1e9),
    "wall_thickness_mm": (0.1, 500.0),
    "diameter_mm": (1.0, 100000.0),         # 1 mm to 100 m
    "density_kg_m3": (0.01, 25000.0),
    "viscosity_cP": (0.001, 1e6),
    "corrosion_rate_mm_yr": (0.0, 50.0),
}

def validate_physical_bounds(field_name, value, unit=None):
    """Check if a value is within physically reasonable bounds."""
    key = field_name.lower().replace(" ", "_")
    for bound_key, (low, high) in PHYSICAL_BOUNDS.items():
        if bound_key in key:
            if value < low or value > high:
                return {
                    "valid": False,
                    "field": field_name,
                    "value": value,
                    "bounds": (low, high),
                    "message": f"{field_name}={value} outside bounds [{low}, {high}]"
                }
            return {"valid": True, "field": field_name, "value": value}
    return {"valid": True, "field": field_name, "value": value, "note": "no bounds defined"}

6. Output Format

6.1 Structured Extraction Result

Every extraction produces this standard JSON structure:

{
  "source": {
    "filename": "design_basis_rev3.pdf",
    "format": "pdf",
    "pages": 42,
    "document_type": "design_basis",
    "confidence": 0.85
  },
  "metadata": {
    "document_title": "Design Basis for Gas Processing Facility",
    "revision": "Rev 3",
    "date": "2024-06-15",
    "document_number": "DOC-12345"
  },
  "fluid_data": {
    "compositions": {
      "feed_gas": {
        "components": {"methane": 0.85, "ethane": 0.07, "propane": 0.03, "CO2": 0.02},
        "basis": "mole_fraction",
        "total": 1.0
      }
    },
    "conditions": {
      "temperature": {"value": 313.15, "unit": "K", "original": "40 °C"},
      "pressure": {"value": 80.0, "unit": "bara", "original": "80 bara"}
    }
  },
  "equipment_data": [
    {
      "tag": "V-100",
      "type": "Separator",
      "service": "HP Separator",
      "design_pressure": {"value": 95.0, "unit": "bara"},
      "design_temperature": {"value": 373.15, "unit": "K"},
      "material": "SA-516-70"
    }
  ],
  "process_data": {
    "streams": {},
    "operating_envelope": {}
  },
  "requirements": [],
  "quality": {
    "total_score": 78,
    "missing_required": ["water_cut"],
    "warnings": ["Composition sums to 0.97 — 3% unaccounted"]
  }
}

6.2 Conversion to NeqSim JSON

The extraction result can be converted to NeqSim process JSON:

def extraction_to_neqsim_json(extraction_result):
    """Convert extraction result to NeqSim ProcessSystem.fromJson() format.

    This bridges the document reader output to the process extraction pipeline.
    """
    fluid_data = extraction_result.get("fluid_data", {})
    compositions = fluid_data.get("compositions", {})
    conditions = fluid_data.get("conditions", {})

    # Pick the first/primary composition
    comp_name = list(compositions.keys())[0] if compositions else None
    if not comp_name:
        return None

    comp = compositions[comp_name]

    result = {
        "fluid": {
            "model": "SRK",
            "temperature": conditions.get("temperature", {}).get("value", 298.15),
            "pressure": conditions.get("pressure", {}).get("value", 1.01325),
            "mixingRule": "classic",
            "components": comp["components"]
        },
        "process": [],
        "autoRun": True
    }

    return result

7. Multi-Document Workflows

When multiple documents describe the same system, merge data with priority rules:

Priority Order (highest first)

Equipment Data Sheets — most specific, verified engineering data
Heat & Mass Balance — simulation-verified stream data
P&ID — definitive topology and instrumentation
Design Basis — project-level specifications
Technical Requirements — design rules and constraints
Vendor Quotations — actual equipment performance
Operating Procedures — actual operational parameters

Merge Strategy

def merge_extractions(extractions):
    """Merge multiple extraction results, respecting priority.

    Args:
        extractions: list of (extraction_result, priority) tuples,
                     sorted by priority (highest first)
    Returns:
        Merged extraction result
    """
    merged = {
        "sources": [],
        "fluid_data": {"compositions": {}, "conditions": {}},
        "equipment_data": [],
        "process_data": {"streams": {}},
        "requirements": [],
        "conflicts": []
    }

    seen_tags = {}  # tag -> source for conflict detection

    for extraction, priority in sorted(extractions, key=lambda x: -x[1]):
        source = extraction.get("source", {})
        merged["sources"].append(source)

        # Merge compositions (higher priority wins)
        for name, comp in extraction.get("fluid_data", {}).get("compositions", {}).items():
            if name not in merged["fluid_data"]["compositions"]:
                merged["fluid_data"]["compositions"][name] = comp

        # Merge equipment (detect conflicts on same tag)
        for equip in extraction.get("equipment_data", []):
            tag = equip.get("tag", "")
            if tag in seen_tags:
                merged["conflicts"].append({
                    "tag": tag,
                    "field": "equipment",
                    "source_1": seen_tags[tag],
                    "source_2": source.get("filename", "unknown")
                })
            else:
                seen_tags[tag] = source.get("filename", "unknown")
                merged["equipment_data"].append(equip)

        # Merge requirements (accumulate all)
        merged["requirements"].extend(extraction.get("requirements", []))

    return merged

8. Practical Tips

Common Parsing Challenges

Challenge	Solution
Merged cells in PDF tables	Use `pdfplumber` with custom table settings: `table_settings={"snap_tolerance": 5}`
Multi-line cell values	Join with space, then re-parse
Header row detection	Look for bold formatting or known keywords
Unicode issues (°, ², ³, µ)	Normalize with `unicodedata.normalize("NFKD", text)`
Tables spanning multiple pages	Detect continued tables by matching column count
Rotated text in PDFs	Use `page.extract_text(layout=True)` or OCR
Mixed number formats (1,000.5 vs 1.000,5)	Detect locale from document language
Empty/missing values	Use sentinel: `None` (not 0, not empty string)

Performance Guidelines

For large documents (>50 pages), extract table of contents first and target relevant sections
Cache extracted data to avoid re-parsing
For Excel files with many sheets, let user specify which sheets to read
For multi-file batches, process in parallel where possible

Required Python Packages

pdfplumber>=0.9.0     # PDF table extraction
python-docx>=0.8.11   # Word document reading
openpyxl>=3.1.0       # Excel reading
pandas>=1.5.0         # Tabular data handling

Optional (for advanced scenarios):

pymupdf>=1.22.0       # PREFERRED for PDF figure extraction (devtools/pdf_to_figures.py uses this)
pytesseract>=0.3.10   # OCR for scanned PDFs
pdf2image>=1.16.0     # Convert PDF pages to images for OCR
tabula-py>=2.7.0      # Alternative PDF table extraction (Java-based)
camelot-py>=0.11.0    # Another PDF table tool (good for bordered tables)