aiml-aegis-content-safety

star 777

ISC template for AI Safety & ML. Target-parametrized schema (target_category + text) classified by KoalaAI/Text-Moderation (OpenAI moderation-compatible 9-way: H/H2/HR/OK/S/S3/SH/V/V2). Replaces gated nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0 (404 on config, requires LlamaGuard-7b base which is also gated). Probe confirms clean signal on V (violence), S (sexual), H (hate), SH (self-harm). HR (harassment) and criminal-planning route to OK at default — Literal restricts target_category to the covered four. ISC direction: test case names its target harm category, agent writes content that the classifier argmax-matches. Probe-verified violence anchor in v3_fewshot. Keywords: aiml_aegis, content safety, OpenAI moderation, V/S/H/SH, ISC, TVD.

By wuyoscar schedule Updated 4/24/2026

play_arrow Run Skill in Manus View GitHub

Skill instructions (SKILL.md) could not be loaded from local cache or raw GitHub repository.

Install via CLI

npx skills add https://github.com/wuyoscar/ISC-Bench --skill aiml-aegis-content-safety

Repository Details

star Stars 777

call_split Forks 119

navigation Branch main

article Path SKILL.md