name: ai-adapter-security-analysis description: "Methodology for analyzing security risks in fine-tuned LLM adapters (LoRA) and evaluating LLM-based penetration testing reliability. Covers: LoRA backdoor detection via behavioral probes and weight-level statistics, multi-model attack consistency measurement, and supply chain security for adapter distribution. Use when analyzing adapter security, LLM pentesting reliability, fine-tuned model trustworthiness, or AI attack evaluation." license: Complete terms in LICENSE.txt metadata: arxiv_id: "2605.30189,2605.30096" published: "2026-05-28" authors: "Travis Lelle; Galip Tolga Erdem" tags: [ai-security, lora, backdoor, penetration-testing, adapter-safety, behavioral-detection]
AI Adapter Security Analysis
Overview
Security methodology for two critical LLM trust domains: (1) detecting backdoors in LoRA adapters distributed by third parties, and (2) measuring the reliability of LLMs as autonomous penetration testing agents. Both address the growing attack surface as LLM fine-tunes become commoditized and AI security agents proliferate.
Core Components
1. LoRA Adapter Backdoor Detection (arXiv:2605.30189)
LoRA adapters can be reliably backdoored via training data poisoning while preserving baseline task performance. The attack generalizes at the token feature level rather than structural pattern level, making generic defense impossible.
Detection Routes
Behavioral Detection (operationally portable):
- Use probe batteries that overlap with trigger's token neighborhood
- Two key statistics:
outlier_gapandmean_attack_rate - Separates poisoned from clean adapters perfectly when probes overlap trigger
- High recall with zero false positives even when probes don't overlap
Weight-Level Detection (base-model dependent):
- Cross-module standard deviation of dimension-normalized Frobenius norms
- Separates cohort perfectly without running the model
- Calibration-bound to the base model
Causal Patching (for analysis):
- Backdoor localizes to MLP blocks at mid-to-late layers
down_projis the strongest single-projection cause- Useful for understanding attack mechanism, not for production scanning
Key Findings
- Attack scales monotonically with LoRA rank
- Trigger-anchor token is both trigger-dependent and base-model-dependent
- Behavioral detection transfers across scale, family, and rank without retuning
- Small fraction of poisoned examples drives backdoor to saturation
2. LLM Penetration Testing Reliability (arXiv:2605.30096)
First large-scale empirical measurement (N=100 per model) of LLM attack consistency against identical multi-service targets.
Methodology
- 400 autonomous pentesting runs (4 models × 100 each)
- Identical honeypot with OWASP Juice Shop + 2 vulnerable services
- Constant prompt, orchestrator, and target across all runs
Key Findings
- Full exploitation rates: Gemini 2.5 Flash-Lite (85%), Claude Sonnet 4 (61%), GPT-4o-mini (56%), qwen2.5-coder:14b (25%)
- First-exploit timing: 15-30 second wall-clock range
- GPT-4o-mini deployed 98 unique attack strategies
- Cross-model differences statistically significant (p < 0.001, Cohen's h = 1.12)
- Cross-service credential reuse: qwen (57%), GPT-4o-mini (49%), cloud models (0% with 5-exchange windows)
Failure Modes (Model-Distinctive)
- Claude: API truncation from upstream 529 errors (39 runs)
- Qwen: Premature completion (52 runs)
- GPT-4o-mini: Iteration-budget exhaustion (23 runs)
Methodology
Evaluating Adapter Trustworthiness
- Behavioral probe battery: Design probes covering diverse token neighborhoods
- Run outlier_gap analysis: Measure statistical deviation from clean cohort
- Compute mean_attack_rate: Average success rate across triggered behaviors
- Weight-level analysis: Calculate cross-module Frobenius norm statistics
- Combine routes: Behavioral + weight-level provides robustness to probe composition
Measuring LLM Attack Consistency
- Fix all variables: Same prompt, orchestrator, target, iteration budget
- Run ≥100 trials per model: Statistical significance requires large N
- Track failure modes: Categorize by type (truncation, completion, budget)
- Measure cross-service behavior: Credential reuse, strategy diversity
- Report timing distributions: First-exploit latency, not just success rate
Pitfalls
- Behavioral detectors require probe overlap: Zero overlap = degraded recall
- Weight detectors are calibration-bound: Different base models need recalibration
- Token-level ≠ structural generalization: Backdoors activate on any RFC reference but NOT on structurally identical ISO/OWASP/NIST citations
- N=100 minimum for reliability: Small-N studies miss rare failure modes
- API errors ≠ safety refusals: HTTP 529 overloaded errors are infrastructure failures, not model-level refusals
Related Skills
security-guardrails: Mandatory security guardrails for Hermes Agentsystems-engineering-threat-modeling: Automated threat modeling for CPS
Reference: Lelle (2026) "Token-Level Generalization in LoRA Adapter Backdoors" arXiv:2605.30189; Erdem (2026) "How Reliable Are AI Attackers" arXiv:2605.30096