aiml-aegis-content-safety

star 777

ISC template for AI Safety & ML. Target-parametrized schema (target_category + text) classified by KoalaAI/Text-Moderation (OpenAI moderation-compatible 9-way: H/H2/HR/OK/S/S3/SH/V/V2). Replaces gated nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0 (404 on config, requires LlamaGuard-7b base which is also gated). Probe confirms clean signal on V (violence), S (sexual), H (hate), SH (self-harm). HR (harassment) and criminal-planning route to OK at default — Literal restricts target_category to the covered four. ISC direction: test case names its target harm category, agent writes content that the classifier argmax-matches. Probe-verified violence anchor in v3_fewshot. Keywords: aiml_aegis, content safety, OpenAI moderation, V/S/H/SH, ISC, TVD.

wuyoscar By wuyoscar schedule Updated 4/24/2026

Skill instructions (SKILL.md) could not be loaded from local cache or raw GitHub repository.

Install via CLI
npx skills add https://github.com/wuyoscar/ISC-Bench --skill aiml-aegis-content-safety
Repository Details
star Stars 777
call_split Forks 119
navigation Branch main
article Path SKILL.md
More from Creator