process-data-explorer

name: process-data-explorer description: Analyze industrial process data (PV/OP time series, events, alarms) for the Control Actions project. Use when exploring parquet/CSV data, understanding tag relationships, profiling data quality, identifying PV-OP pairs, or mapping event sources to time series columns. metadata: author: control-actions-team version: "2.0" target-alarm: "03LIC_1071" last-run: "2026-01-18"

Process Data Explorer

Version 2.0 Updates

Trip Period Filtering: Automatically excludes data during plant trips
Date Range Filtering: Focus analysis on specific time periods
Shared Preprocessing: Uses centralized shared/data_loader.py for consistent data handling

When to Use This Skill

Use this skill when you need to:

Profile and understand the structure of PV/OP time series data
Explore events/actions data and understand its relationship to time series
Identify which tags have both .PV and .OP columns
Map tag names between different data sources (knowledge graph, events, time series)
Generate data quality reports

Quick Start

# Profile time series for 2025 data (with trip filtering)
python .skills/process-data-explorer/scripts/profile_timeseries.py \
    --start-date 2025-01-01 --end-date 2025-06-30

# Map data relationships (filtered)
python .skills/process-data-explorer/scripts/map_data_relationships.py \
    --start-date 2025-01-01 --end-date 2025-06-30

CLI Options

All scripts support these common options:

--start-date YYYY-MM-DD: Filter data from this date
--end-date YYYY-MM-DD: Filter data until this date
--trip-file PATH: Path to trip duration file
--no-trip-filter: Disable trip period filtering
--recent: Analyze only recent 6 months
--last-year: Analyze only last year of data

Latest Run Results (Jan 2026)

The scripts have been run and outputs saved to:

RESULTS/timeseries_profile.json - Full time series profile
RESULTS/data_relationships.json - Tag mapping and relationships

Key findings:

15 controllable tags (have both PV and OP)
13 PV-only tags (monitoring only)
1,737,586 rows of minute-wise data (2022-01-03 to 2025-06-23)
41 time gaps >5 minutes (largest: 446 hours)
1,861 PVLO alarm episodes for target tag

Data Sources Overview

1. PV/OP Time Series (`DATA/03LIC_1071_JAN_2026.parquet`)

Columns: End with .PV (Process Variable) or .OP (Output)
Special Columns: AlarmStatus (ON/OFF), AlarmType (PVLO/blank) - these are object type, not numeric
Index: TimeStamp (minute-wise readings)
Range: 2022-01-03 to 2025-06-23 (~3.5 years)

2. Events Data (`DATA/df_df_events_1071_export.csv`)

Key Columns: Source, VT_Start, ConditionName, Action, Value, PrevValue
Event Types: CHANGE (operator actions), PVLO/PVHI (alarms)

3. Related Tags (`DATA/03LIC1071_PropaneLoop_0426.csv`)

Column: tagName - tags from knowledge graph related to target alarm

Step-by-Step Instructions

Step 1: Load and Profile Time Series Data

import pandas as pd

# Load PV/OP data
op_pv_data_df = pd.read_parquet('DATA/03LIC_1071_JAN_2026.parquet')
op_pv_data_df.set_index('TimeStamp', inplace=True)
op_pv_data_df.sort_index(inplace=True)

# Identify PV and OP columns
cols = op_pv_data_df.columns
op_tags = {col.replace('.OP', '') for col in cols if col.endswith('.OP')}
pv_tags = {col.replace('.PV', '') for col in cols if col.endswith('.PV')}

# Find tags with BOTH PV and OP (controllable tags)
controllable_tags = op_tags & pv_tags
print(f"Tags with both PV and OP: {len(controllable_tags)}")
print(f"Tags with only OP: {op_tags - pv_tags}")
print(f"Tags with only PV: {pv_tags - op_tags}")

Step 2: Profile Data Quality

Run the profiling script:

python .skills/process-data-explorer/scripts/profile_timeseries.py

Or manually:

# Check time range
print(f"Time range: {op_pv_data_df.index.min()} to {op_pv_data_df.index.max()}")

# Check for gaps in time series
time_diff = op_pv_data_df.index.to_series().diff()
gaps = time_diff[time_diff > pd.Timedelta(minutes=5)]
print(f"Number of gaps > 5 minutes: {len(gaps)}")

# Missing values per column
missing = op_pv_data_df.isnull().sum()
print(f"Columns with missing values: {missing[missing > 0]}")

Step 3: Load and Profile Events Data

# Load events data
events_df = pd.read_csv("DATA/df_df_events_1071_export.csv", low_memory=False)
events_df['VT_Start'] = pd.to_datetime(events_df['VT_Start'])
events_df = events_df.sort_values('VT_Start')

# Profile event types
print("Event types (ConditionName):")
print(events_df['ConditionName'].value_counts())

# CHANGE events = operator/automated actions
change_events = events_df[events_df['ConditionName'] == 'CHANGE']
print(f"\nCHANGE events: {len(change_events)}")
print(f"Unique sources with CHANGE: {change_events['Source'].nunique()}")

Step 4: Map Tags Between Data Sources

Use the strings_similar() function to match tags across sources:

def strings_similar(s1, s2):
    """Match tags with slight naming variations."""
    s1 = str(s1).strip().replace(' ', '').upper()
    s2 = str(s2).strip().replace(' ', '').upper()
    
    if len(s1) < 3 or len(s2) < 3:
        return False
    if s1[:3] != s2[:3]:
        return False
    
    if '_' not in s1 or '_' not in s2:
        return False
    
    s1_after = s1.split('_', 1)[1]
    s2_after = s2.split('_', 1)[1]
    
    if s1_after == s2_after:
        return True
    
    shorter = s1_after if len(s1_after) <= len(s2_after) else s2_after
    longer = s2_after if len(s1_after) <= len(s2_after) else s1_after
    
    for i in range(len(shorter) - 3):
        if shorter[i:i+4] in longer:
            return True
    return False

Step 5: Generate Relationship Summary

Run the relationship mapping script:

python .skills/process-data-explorer/scripts/map_data_relationships.py

Key Outputs

After running this skill, you should have:

Tag inventory: List of all PV/OP tag pairs
Data quality report: Missing values, gaps, anomalies
Tag mapping: How tags in events relate to time series columns
Time range alignment: Verify events and time series cover same period

Common Issues

Tag name mismatches: Use strings_similar() for fuzzy matching
Timezone issues: Ensure all timestamps are in same timezone
Category filtering: Use Category == 1 for filtering relevant events