logical-accuracy - SKILL.md Agent Skill

name: logical-accuracy description: Detect logical outliers in clinical variables using a multi-model ensemble (Quantile Regression + Gradient Boosting + Autoencoder for numeric; OneClassSVM + IsolationForest + Autoencoder for categorical). Uses Claude CLI (no API key required). Use for LYDUS data quality assessment of logical consistency in event, diagnosis, prescription, and procedure variables. tier: community category: lydus parameters: quiq_path: description: Path to QUIQ-format CSV file (output of quiq skill). type: string save_path: description: Directory path to save output files (logical_accuracy_summary.csv, logical_accuracy_total.txt, outlier_*.csv). type: string operation_type_manual: description: "True = manually specify target_variable; False = automatic top-N by count." type: boolean target_variable: description: Target variable name (used when operation_type_manual=True). type: string automatic_num: description: Number of top variables to analyze automatically (used when operation_type_manual=False). Default 5. type: integer recommend_num: description: Number of correlated variables recommended by LLM. Default 5. type: integer

Logical Accuracy

Detects logical outliers in clinical variables using a multi-model anomaly detection ensemble. A value is flagged as an outlier only when all three models agree — minimizing false positives.

When to Use This Skill

After QUIQ conversion, to assess whether recorded clinical values are logically consistent with the patient's clinical context
To detect physiologically implausible values (e.g., abnormal lab results inconsistent with other measurements)
As part of LYDUS quality management assessment

SQL Support

Not applicable. This skill requires:

Claude CLI calls for sex/birthdate variable identification and correlated-variable recommendation
statsmodels Quantile Regression
sklearn (GBR, OneClassSVM, IsolationForest, RobustScaler)
PyTorch Autoencoder with early stopping

Filtering Logic

Category	`Mapping_info_1`	`Variable_type`	`Is_categorical`	Mode
Event (numeric)	contains `event`	contains `numeric`	= 0	`evaluate_mode=0`
Diagnosis	contains `diagnosis`	any	= 1	`evaluate_mode=1`
Prescription (drug)	contains `prescription`, `Mapping_info_2` contains `drug`	any	= 1	`evaluate_mode=1`
Procedure	contains `procedure`	any	= 1	`evaluate_mode=1`

Two Operation Modes

Parameter	Mode	Description
`operation_type_manual=True`	Manual	Analyze a single specified `target_variable`
`operation_type_manual=False`	Automatic	Analyze top-N variables by count

Pipeline

For each target variable:

LLM identifies sex variable and birthdate variable in the QUIQ data
LLM recommends recommend_num correlated variables
Clinical context vector is built by joining target + recommended variables on (Patient_id, Event_date ± 7 days)
Outlier detection based on evaluate_mode:

evaluate_mode=0 (Numeric Event Variables)

Quantile Regression (q=0.01, 0.99)
Gradient Boosting Regressor (q=0.01, 0.99)
Autoencoder (reconstruction error > 98th percentile)
Outlier: outside ALL bounds simultaneously (upper OR lower)

evaluate_mode=1 (Categorical Variables)

One-Class SVM (nu=0.02, kernel=rbf)
Isolation Forest (contamination=0.02)
Autoencoder (mean reconstruction error > 98th percentile)
Outlier: flagged by ALL three models simultaneously

Autoencoder Architecture

Encoder: Linear(d → d//1.3) → Tanh → Linear(d//1.3 → d//2)
Decoder: Linear(d//2 → d//1.3) → Tanh → Linear(d//1.3 → d)
Optimizer: Adam (lr=0.001), Loss: MSE, Early stopping (patience=5, min_delta=0.001)

Output

File	Description
`logical_accuracy_total.txt`	Logical Accuracy (%), Total Num, Outlier Num
`logical_accuracy_summary.csv`	Per-variable: Total Num, Outlier Num, Logical Accuracy (%)
`outlier_{i}_{variable}.csv`	Outlier rows for each variable with outliers

How to Run

import pandas as pd
from scripts.logical_accuracy import get_logical_accuracy

quiq = pd.read_csv("/path/to/quiq.csv")

var_list_target, dict_total, dict_outlier = get_logical_accuracy(
    quiq=quiq,
    operation_type_manual=False,
    target_variable="",        # ignored when operation_type_manual=False
    automatic_num=5,
    recommend_num=5
)

As a script with config

# config.yaml
quiq_path:            /path/to/quiq.csv
save_path:            /path/to/output
operation_type_manual: false
target_variable:      ""       # only needed when operation_type_manual=true
automatic_num:        5
recommend_num:        5

python scripts/logical_accuracy.py --config config.yaml

Critical Notes

All-models-agree criterion — a value is an outlier only if all 3 models flag it. This is intentionally conservative to reduce false positives.
Clinical context window — correlated numeric variables are matched within ±7 days of the target measurement date (closest value used).
Sex + birthdate enrichment — LLM identifies these automatically. If not found, the context vector is built without them.
Original code bug fixed — line 661 in LYDUS_Logical_Accuracy.py had a missing comma: 'Outlier Num' 'Logical Accuracy (%)' → Python string concatenation created column 'Outlier NumLogical Accuracy (%)'. Fixed in this skill.
Memory management — gc.collect() is called after each major step. For large QUIQ tables (millions of rows), consider limiting automatic_num.
Dependencies — statsmodels, scikit-learn, torch, numpy, pandas (LLM: Claude CLI via subprocess)

References

LYDUS 품질관리 프로그램 활용 가이드라인 (비공개 내부 문서)
Original Python implementation: LYDUS_Logical_Accuracy.py (이성민 작성)