name: "semantic-aware-advanced-persistent-threat" description: > Build anomaly detection pipelines for Advanced Persistent Threat (APT) detection by encoding system logs into semantic embeddings with sentence-transformer LLMs and training autoencoders on reconstruction error. Trigger phrases: "detect APT in system logs", "semantic log anomaly detection", "autoencoder on log embeddings", "LLM-based threat detection", "provenance log analysis", "system log anomaly pipeline".
Semantic-Aware APT Detection via LLM-Encoded System Logs
This skill enables Claude to build end-to-end anomaly detection pipelines that catch Advanced Persistent Threats (APTs) in system logs by combining semantic embeddings from a sentence-transformer model (all-mpnet-base-v2) with autoencoder-based reconstruction error analysis. The core insight is that converting raw, unstructured system log entries into 768-dimensional semantic vectors captures the intent behind system activities -- something statistical features and shallow ML miss entirely -- allowing an autoencoder trained only on normal behavior to flag stealthy, "low-and-slow" attack patterns through elevated reconstruction error.
When to Use
- When the user asks to detect APTs, intrusions, or anomalies in system/audit logs (syslog, auditd, provenance graphs, CDM records)
- When building an unsupervised anomaly detection pipeline over unstructured log data and needs to go beyond Isolation Forest / OC-SVM / PCA baselines
- When the user has DARPA Transparent Computing (TC) dataset traces (THEIA, TRACE, CADETS, CLEARSCOPE, 5DIR) and wants to run detection experiments
- When the user wants to convert raw provenance records or system call logs into meaningful embeddings for downstream ML
- When the user asks about using LLMs or transformers for cybersecurity log analysis
- When designing a reconstruction-error-based anomaly scoring system for security event streams
Key Technique
Semantic embedding of logs: Raw system logs are first transformed from structured fields (process IDs, file paths, network connections, syscall names) into natural language descriptions. For example, a provenance record becomes: "Process 1054 started /bin/bash and connected socket 192.168.1.5:80 and changed /etc/passwd." This natural language sentence is then encoded using sentence-transformers/all-mpnet-base-v2 (a 768-dim MPNet model fine-tuned on 1B sentence pairs) to produce a dense semantic vector that captures the behavioral meaning of the log entry -- not just its tokens.
Autoencoder anomaly scoring: A symmetric feed-forward autoencoder (768 -> 512 -> 128 -> 512 -> 768, ReLU activations, MSE loss, Adam optimizer at lr=0.001) is trained exclusively on embeddings from benign/normal activity for 15 epochs with batch size 128. Because the autoencoder learns to reconstruct only normal behavioral patterns, malicious activity -- which has different semantic structure -- produces higher reconstruction error (MSE between input and output embedding). A threshold on this error, tuned on a validation set, classifies each log entry as normal or anomalous.
Why it works better: Traditional methods (IForest, OC-SVM, PCA) operate on raw features or shallow statistics and miss the non-linear semantic relationships between system activities. The LLM embedding captures contextual meaning (e.g., that /bin/bash connecting to an external IP while modifying /etc/passwd is semantically unusual), giving the autoencoder richer signal. On the DARPA TC dataset, this approach achieves AUC > 0.95 in 11 out of 23 experimental conditions across five provenance contexts (ProcessEvent, ProcessExec, ProcessParent, ProcessNetflow, ProcessAll).
Step-by-Step Workflow
Ingest and parse raw logs: Read system logs (audit logs, provenance CDM records, syslog entries) and extract structured fields -- process ID, parent process, binary path, syscall type, file paths accessed, network connections (IP:port), timestamps.
Convert log entries to natural language sentences: For each log entry, construct a descriptive sentence combining the extracted fields. Use templates like:
"Process {pid} ({binary}) [started|read|wrote|connected|changed] {target} [and {additional_actions}]."Include only fields present in the entry -- omit absent fields rather than using placeholders.Generate semantic embeddings: Load
sentence-transformers/all-mpnet-base-v2and encode each natural language log sentence into a 768-dimensional vector. Batch the encoding for efficiency (batch size 256-512 depending on GPU memory).from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-mpnet-base-v2') embeddings = model.encode(log_sentences, batch_size=256, show_progress_bar=True)Split data for training: Separate embeddings into training (benign-only), validation (benign + small known-malicious sample for threshold tuning), and test sets. The autoencoder must train on only normal behavior.
Build and train the autoencoder:
import torch.nn as nn class LogAutoencoder(nn.Module): def __init__(self): super().__init__() self.encoder = nn.Sequential( nn.Linear(768, 512), nn.ReLU(), nn.Linear(512, 128), nn.ReLU() ) self.decoder = nn.Sequential( nn.Linear(128, 512), nn.ReLU(), nn.Linear(512, 768) ) def forward(self, x): return self.decoder(self.encoder(x))Train with Adam (lr=0.001), MSE loss, batch size 128, for 15 epochs. Monitor validation loss for convergence -- expect rapid early convergence then plateau.
Compute reconstruction errors: Pass all embeddings through the trained autoencoder. Calculate per-sample MSE between input and reconstructed embedding. This error is the anomaly score.
Set the anomaly threshold: Using the validation set (which contains labeled benign and malicious samples), select the threshold on reconstruction error that maximizes the F1 score or achieves the desired precision-recall tradeoff. Alternatively, use the 95th or 99th percentile of training-set reconstruction errors.
Evaluate with AUC-ROC: Score the test set and compute AUC-ROC. Compare against baselines (IForest, OC-SVM, PCA on the same embeddings) to validate the autoencoder advantage.
Analyze detected anomalies: For flagged entries, decode back to the original log sentence and inspect the semantic content. Cluster high-error embeddings with t-SNE to identify attack campaign structure or lateral movement patterns.
Deploy as a streaming detector (optional): Wrap the embedding + autoencoder inference in a service that processes log streams in micro-batches, emitting alerts when reconstruction error exceeds the threshold.
Concrete Examples
Example 1: Building an APT detector on DARPA TC THEIA traces
User: "I have DARPA TC THEIA provenance logs in JSON format. Help me build an APT detector using LLM embeddings."
Approach:
- Parse THEIA JSON records, extracting subject (process), object (file/socket), action (syscall), and metadata fields
- Convert each record to natural language:
"Process 2381 (firefox) connected to socket 10.0.2.15:443 and wrote /tmp/.cache_update" - Encode all sentences with
all-mpnet-base-v2-> 768-dim vectors - Filter known-benign time windows for training set; reserve attack-window data for test
- Train autoencoder (768->512->128->512->768) on benign embeddings for 15 epochs
- Score all test embeddings by MSE reconstruction error
- Tune threshold on validation split, compute AUC-ROC
Output:
Anomaly Detection Results (THEIA - ProcessAll context):
AE (proposed): AUC-ROC = 0.967
IForest: AUC-ROC = 0.823
OC-SVM: AUC-ROC = 0.791
PCA: AUC-ROC = 0.854
Top anomalous entries (by reconstruction error):
1. [err=0.142] "Process 2381 (firefox) connected to socket 10.0.2.15:443 and wrote /tmp/.cache_update"
2. [err=0.138] "Process 4820 (/bin/sh) read /etc/shadow and connected to socket 203.0.113.5:8080"
3. [err=0.131] "Process 4820 (/bin/sh) started /usr/bin/scp with args ['-r', '/var/log', '203.0.113.5:/exfil']"
Example 2: Adding semantic anomaly detection to an existing SIEM pipeline
User: "We have auditd logs streaming into Kafka. I want to add an LLM-based anomaly scoring stage."
Approach:
- Write a Kafka consumer that reads auditd log batches
- Parse each auditd record into structured fields (uid, pid, exe, syscall, path, addr)
- Template into sentences:
"User root (uid=0) process 891 (/usr/sbin/sshd) executed EXECVE /bin/bash and accessed /etc/passwd" - Batch-encode with
all-mpnet-base-v2(pre-load model at startup) - Run through pre-trained autoencoder, compute MSE per entry
- Publish anomaly scores back to a Kafka topic; alert on scores above threshold
Output (streaming service pseudocode):
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer('all-mpnet-base-v2')
autoencoder = LogAutoencoder()
autoencoder.load_state_dict(torch.load('apt_detector.pt'))
autoencoder.eval()
THRESHOLD = 0.089 # tuned on validation set
for batch in kafka_consumer:
sentences = [to_natural_language(record) for record in batch]
embeddings = torch.tensor(model.encode(sentences))
with torch.no_grad():
reconstructed = autoencoder(embeddings)
errors = ((embeddings - reconstructed) ** 2).mean(dim=1)
for record, err in zip(batch, errors):
if err > THRESHOLD:
publish_alert(record, anomaly_score=err.item())
Example 3: Comparing provenance contexts for detection accuracy
User: "Which log context gives the best APT detection -- process events, network flows, or the combined view?"
Approach:
- From the same dataset, extract five provenance views: ProcessEvent (syscalls only), ProcessExec (binary executions), ProcessParent (parent-child relationships), ProcessNetflow (network connections), ProcessAll (combined)
- Generate separate embedding sets for each view using context-specific sentence templates
- Train a separate autoencoder per view on benign data
- Evaluate AUC-ROC for each
Output:
Context Comparison (CADETS dataset):
ProcessAll: AUC = 0.971 <- best (richest semantic context)
ProcessNetflow: AUC = 0.943
ProcessExec: AUC = 0.921
ProcessEvent: AUC = 0.908
ProcessParent: AUC = 0.887
Recommendation: Use ProcessAll (combined) context for production.
Network-only context is a strong second choice with lower compute cost.
Best Practices
- Do: Train the autoencoder exclusively on verified benign/normal data. Any malicious contamination in training will teach the model to reconstruct attacks as "normal," destroying detection capability.
- Do: Use context-rich natural language templates that include process names, file paths, network addresses, and action verbs. Richer sentences produce more discriminative embeddings.
- Do: Evaluate across multiple provenance contexts (process, network, combined) since different APT campaigns manifest differently -- some are network-heavy, others are filesystem-heavy.
- Do: Normalize embeddings (L2 normalization) before feeding into the autoencoder if you observe training instability or widely varying reconstruction errors.
- Avoid: Using generic or overly terse log templates like
"event: write, pid: 1054". The semantic model needs natural language to activate its learned representations. - Avoid: Setting the anomaly threshold on the training set alone. Always use a held-out validation set with at least some labeled malicious samples to calibrate the decision boundary.
- Avoid: Treating this as a supervised classifier. The autoencoder is unsupervised by design -- it learns normality, not attack signatures. Mixing labeled attack data into training defeats the purpose.
Error Handling
- Embedding model fails to load: Ensure
sentence-transformersis installed (pip install sentence-transformers) and the model name is exactlyall-mpnet-base-v2. For air-gapped environments, download the model files ahead of time and load from a local path. - GPU out of memory during encoding: Reduce the encoding batch size (try 64 or 32). The
all-mpnet-base-v2model is ~420MB and needs ~2GB VRAM for inference at batch size 256. - Autoencoder loss doesn't converge: Check that input embeddings are not all zeros (indicates failed encoding). Verify learning rate (0.001 is the starting point; try 0.0001 if loss oscillates). Ensure training data contains only benign samples.
- Too many false positives: The threshold is too low. Re-tune on the validation set targeting higher precision. Consider per-context thresholds if using multiple provenance views.
- Too many false negatives (misses APT): The log-to-sentence template may be too sparse. Include more fields (parent process, command-line arguments, file permissions). Also try the ProcessAll combined context.
- Reconstruction error distribution is bimodal on training data: Likely indicates training data contamination with anomalous entries. Audit and clean the training set.
Limitations
- Requires clean benign training data: If the training logs already contain undetected APT activity, the autoencoder will learn to reconstruct those attacks as normal. A baseline period of verified-clean activity is essential.
- Latency: Encoding each log entry through a 110M-parameter transformer adds inference latency (~5-15ms per entry on GPU). For high-throughput environments (>10K events/sec), batching and GPU acceleration are mandatory.
- Embedding model domain gap:
all-mpnet-base-v2is trained on general English text, not security logs specifically. While the natural language templating bridges this gap, fine-tuning on security corpora could improve results. - No attack attribution: The method detects anomalies but does not classify attack type, stage, or campaign. Detected anomalies require analyst review or downstream correlation.
- Threshold sensitivity: The binary normal/anomalous decision depends on a single threshold. In practice, use tiered alerting (high/medium/low confidence) based on reconstruction error magnitude.
- Single-entry analysis: Each log entry is scored independently. The method does not model temporal sequences or causal chains across entries -- it may miss multi-step attacks where each individual step looks benign.
Reference
Paper: Semantic-Aware Advanced Persistent Threat Detection Using Autoencoders on LLM-Encoded System Logs (Khan Mohammed et al., 2026). Key sections: Section III for the log-to-embedding pipeline and autoencoder architecture, Section IV for DARPA TC dataset setup, Section V for AUC-ROC results across five provenance contexts and four detection methods.