ai-native-development - SKILL.md Agent Skill

name: ai-native-development description: "Use when reasoning about agent autonomy levels, designing auto-improve loops, evaluating AI-generated code quality, or measuring agent productivity in an LLM-assisted codebase. Covers Karpathy's three eras of software (1.0 explicit / 2.0 learned / 3.0 natural-language), the vibe-coding-vs-agentic-engineering distinction, the 0–5 autonomy slider with task-type recommendations, the one-asset / one-metric / one-time-box AutoResearch loop, Software 3.0 productivity metrics, and the documented quality regressions of ungated AI-generated code (the 'vibe hangover'). Do NOT use for choosing a specific autonomy-loop topology (use `agent-engineering`), for the per-prompt authoring discipline (use `prompt-craft`), or for reviewing the AI-generated code that comes out of a Software 3.0 workflow (use `code-review`). Do NOT use for improve this specific prompt for the grader. Do NOT use for review this AI-generated PR for correctness. Do NOT use for design the checkpoint state machine for our loop." license: MIT compatibility: "Provider- and runtime-agnostic. The autonomy-slider levels and quality-gate sequence apply to any LLM-coding harness (Claude Code, OpenCode, Cursor, Aider, Copilot Workspace, Continue) that supports a deterministic verify step between agent output and merge." allowed-tools: Read Grep metadata: relations: "{"related":["prompt-craft","skill-router","tool-call-strategy","agent-engineering","code-review"],"verify_with":["code-review","testing-strategy"]}" subject: agent-ops scope: "Reasoning about agent autonomy levels, auto-improve loops, AI-generated code quality, and productivity in LLM-assisted codebases — Karpathy's three eras (1.0 explicit / 2.0 learned / 3.0 natural-language), the vibe-coding-vs-agentic-engineering distinction, the 0–5 autonomy slider with task-type recommendations, the one-asset/one-metric/one-time-box AutoResearch loop, Software 3.0 productivity metrics, and the documented regressions of ungated AI-generated code (the 'vibe hangover'). Portable across any LLM-assisted codebase; principle-grounded, not repo-bound. Excludes choosing a specific autonomy-loop topology (agent-engineering), per-prompt authoring discipline (prompt-craft), and reviewing the AI-generated code itself (code-review)." public: "true" taxonomy_domain: agent/concepts stability: experimental keywords: "["software 3.0 concepts","vibe coding","agentic engineering doctrine","autonomy slider","prompt as code","karpathy three eras","autoresearch loop","ai-generated code quality","vibe hangover","llm-native development"]" examples: "["we keep accepting agent-generated code on first try and shipping bugs — what discipline replaces this?","what autonomy level should I run for a security-sensitive change?","does measuring lines-of-code per session make sense when an agent generates the code?","the team is treating prompts and skill files like throwaway notes — what's the alternative framing?","we want an auto-improve loop for our skill content — how do we constrain it so it doesn't regress?","what's the conceptual difference between a vibe coding session and an agentic engineering session?","AI-generated code is shipping with vulnerabilities — what gates should sit between agent output and production?","how do I match autonomy level to the risk profile of the task?"]" anti_examples: "["improve this specific prompt for the grader","review this AI-generated PR for correctness","design the checkpoint state machine for our loop","scaffold a new skill that codifies our coding doctrine","the autonomous loop is stalling — debug it"]" grounding: "{"subject_matter":"AI-native software development discipline for prompt-as-code workflows, agent autonomy calibration, metric-gated auto-improvement loops, and quality gates for AI-generated code","grounding_mode":"hybrid","truth_sources":["https://www.youtube.com/watch?v=LCEmiRjPEtQ\",\"https://github.com/karpathy/autoresearch\",\"https://arxiv.org/abs/2211.03622\",\"https://arxiv.org/abs/2504.20814\",\"https://snyk.io/lp/secure-adoption-in-the-genai-era/\",\"https://owasp.org/www-project-top-10-for-large-language-model-applications/\"],\"failure_modes\":[\"unintentional_high_autonomy_for_high_risk_work\",\"accepting_ai_generated_code_without_review_or_tests\",\"treating_prompts_and_skills_as_throwaway_notes\",\"optimizing_agent_loops_against_multiple_moving_metrics\",\"citing_stale_ai_code_security_statistics_as_fixed_truth\",\"shipping_agentic_systems_without_prompt_injection_or_excessive_agency_controls\"],\"evidence_priority\":\"equal\"}" skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph" skill_graph_project: Skill Graph skill_graph_canonical_skill: skills/agent-ops/ai-native-development/SKILL.md skill_graph_export_description_projection: anti_examples skill_graph_export_description_projection_truncated: "true"

AI-Native Development

Concept of the skill

Reasoning about agent autonomy levels, auto-improve loops, AI-generated code quality, and productivity in LLM-assisted codebases — Karpathy's three eras (1.0 explicit / 2.0 learned / 3.0 natural-language), the vibe-coding-vs-agentic-engineering distinction, the 0–5 autonomy slider with task-type recommendations, the one-asset/one-metric/one-time-box AutoResearch loop, Software 3.0 productivity metrics, and the documented regressions of ungated AI-generated code (the 'vibe hangover').

Coverage

The conceptual model for software development when an LLM participates in code creation. Specifically: Andrej Karpathy's three eras of software (1.0 explicit code / 2.0 learned weights / 3.0 natural-language programs); the vibe-coding-vs-agentic-engineering distinction and when each is appropriate; the 0–5 autonomy slider mapping task type and risk to the right level of agent independence; the AutoResearch improvement loop with its three constraints (one editable asset, one scalar metric, one time box); Software 3.0 productivity metrics that replace lines-of-code and commit-count for an LLM-assisted team; the documented security and quality regressions of ungated AI-generated code (the "vibe hangover") and the quality-gate sequence that compensates for them; and the operating principle that prompts, skill files, and agent-runtime configuration are source code — versioned, reviewed, tested.

Philosophy of the skill

A prompt is a program. A skill file is a library. An agent session is a runtime. This is not a metaphor; it is the literal operational model of an LLM-assisted codebase. The mistake teams make is treating these artifacts as ad-hoc notes — the same mistake early industry made with shell scripts before treating them as version-controlled software. AI-native development is the discipline of putting the same engineering rigor around prompts and skills that any team puts around production code: source control, code review, tests, contracts, observability.

The largest single failure mode at the team level is unintentional autonomy. Without an explicit framing, every agent session defaults to the highest autonomy the harness allows, regardless of the task's risk. Vibe coding is not wrong — for a throwaway prototype it is correct. It is wrong as the default for production code. The autonomy slider is the framing tool that lets a team decide intentionally where on the slider any given task should run, and what gates compensate when autonomy goes up.

1. The Three Eras of Software

Karpathy named a structural shift in how programs are produced:

Software 1.0 — Explicit code

Humans write instructions in a programming language. A compiler or interpreter executes them. Behavior is deterministic and fully auditable. Bugs are logic errors in code humans wrote.

Human writes code → compiler/interpreter runs code → output

Software 2.0 — Learned programs

Humans curate data and pick an architecture. An optimizer trains weights. The trained network is the program. Behavior is probabilistic; auditability is partial (interpretability is an open problem). Bugs are distribution mismatches or training artifacts.

Human curates data + defines architecture → training → weights (the program) → output

Software 3.0 — Natural-language programs

Humans write a specification in natural language. An LLM interprets the specification and produces behavior. The "code" is the prompt. Behavior is stochastic — the same prompt can produce different output across runs. Bugs are ambiguities in the prompt or gaps in the model's knowledge.

Human writes prompt → LLM interprets prompt → output

The mapping

In Software 3.0 the prompt is the program. Every traditional software-engineering concept has an analogue:

Traditional concept	Software 3.0 equivalent
Source code	Prompt files, system prompts, skill specifications
Libraries	Reusable skill files
Compiler	LLM inference engine
Linker	Skill-injector / context-loader
Runtime	Agent session
RAM	Context window
Debugger	Context-failure analysis
Tests	Eval suites
Version control	Skill / prompt versioning + git

Once the mapping is explicit, the engineering disciplines transfer: review the prompt the way you'd review a function; version the skill the way you'd version a library; eval the agent's output the way you'd run unit tests against a build.

2. Vibe Coding vs Agentic Engineering

"Vibe coding" was named by Karpathy (Feb 2025) for the practice of generating code by feel — describing what you want, accepting the output, iterating by vibes. It is the default mode of most AI-assisted development. Agentic engineering is the disciplined alternative: structured, verifiable, with quality gates at every step.

Dimension	Vibe coding	Agentic engineering
Planning	None — "just start coding"	Explicit plan or task spec
Specification	Verbal / mental model	Written contracts (acceptance criteria, ADRs, skill files)
Code generation	Accept first output	Generate → verify → iterate
Review	Skim the diff	Automated gates (lint, type-check, tests) + human spot-check
Quality	"Does it look right?"	Measurable criteria (evals pass, CI green)
Knowledge	Lost between sessions	Captured (skills, memory, ADRs, decision records)
Reproducibility	Low — depends on prompt phrasing	High — same skill content produces same behavior
Security	"It probably works"	Explicit security review; threat model considered
Scale	Fits small prototypes	Fits production systems with multiple agents

When vibe coding is the right tool

Vibe coding is correct for: throwaway prototypes, personal scripts with no users, learning a new library by playing with it, design exploration before committing to an approach.

When vibe coding is the wrong tool

Vibe coding is wrong for: production code, financial calculations, security-sensitive logic (auth, authorization, data handling), shared codebases where other developers or agents will maintain the code.

3. The Autonomy Slider

Agent autonomy exists on a spectrum, not a binary. The right level depends on three inputs: task type, quality of available context, and consequences of failure.

Levels

Level	Name	Human role	Agent role	Example
0	Manual	Writes all code	None	Traditional development
1	Suggestion	Reviews suggestions, accepts/rejects	Suggests completions	Inline tab-completion
2	Drafting	Reviews drafts, edits before commit	Generates complete drafts from prompts	"Write a component that does X"
3	Implementing	Reviews finished work, runs gates	Implements full features, writes tests	Agent completes one ticket end-to-end
4	Autonomous + spot-check	Spot-checks via session summary	Implements, tests, documents, commits	Multi-task queue worked independently
5	Fully autonomous	Monitors metrics, intervenes on anomalies	Prioritize, implement, verify, deploy	Theoretical — not yet safe for production

Autonomy by task type

Task type	Recommended level	Why
Bug fix with failing test	4	Clear acceptance criteria; low ambiguity
New feature implementation	3	Architectural decisions need human review
Codebase audit	4	Research; agent investigates autonomously
Security-sensitive change	2	High consequences; human must verify
Financial calculation logic	2–3	Monetary consequences; careful review needed
Documentation update	4	Low risk; agent verifies against source
UI / visual implementation	3	Visual judgment; screenshot review required
Refactor with green tests	3–4	Tests guard correctness; scope review still needed
Production deployment	1	High consequences; human controls the process

Autonomy prerequisites

Higher autonomy requires better infrastructure. The slider is not a free parameter; moving it up requires the supporting controls.

Level	Required infrastructure
2	Prompt quality, basic type checking
3	Automated tests, CI pipeline, skill system, code review
4	Tripwire guardrails on destructive operations, structured session-summary protocol, persistent memory, eval suite, model routing
5	Self-healing, anomaly detection, automatic rollback, comprehensive evals, runtime observability

A team that runs at level 4 without the level-4 infrastructure is not "moving fast"; it is shipping at level 4 with level-2 safety, and the gap will surface as production incidents.

4. The AutoResearch Loop

Karpathy's autoresearch pattern is the simplest reliable shape for autonomous agent improvement work:

LOOP:
  1. Modify one thing (code, config, parameter, prompt)
  2. Run the experiment (execute, measure)
  3. Check: did the metric improve?
       YES → keep the change, continue
       NO  → revert the change, try something else
  4. Repeat until the time box expires

The three constraints

The loop works because of what it forbids, not what it allows:

One editable asset — agent can modify one file, one function, or one parameter set. No cascading multi-file edits in a single iteration. Prevents diffuse failure.
One scalar metric — success is one number (accuracy, latency, score, cost). Prevents the "improved on A but regressed B and we didn't notice" failure.
One time box — the loop runs for a fixed duration. Prevents infinite exploration on a problem the loop won't solve.

When to use AutoResearch vs manual iteration

Situation	AutoResearch	Manual
Optimizing a measurable metric	Yes	No
Exploring design alternatives with a judge	Yes (with judge as the metric)	Also fine
Implementing a specified feature	No	Yes
Debugging a specific bug with known root cause	No	Yes
Tuning a prompt against an eval set	Yes	Also fine
Performance optimization with a clear metric	Yes	If the metric is composite

Common failure modes

Multi-axis edits. The agent edits two files in one iteration; the metric improves; you don't know which edit caused it. Solution: enforce one-asset programmatically.
Metric drift. The metric the loop optimizes is not the metric you actually care about. Solution: validate the metric on a held-out set before starting the loop.
No time box. The loop runs indefinitely against a problem the loop can't solve. Solution: hard limit; manual review at expiry.

5. Software 3.0 Productivity Metrics

Traditional software metrics — lines of code, commits per day, velocity points — are meaningless when an agent can produce 10,000 lines in five minutes. The question is not how much was produced; it is whether what was produced was correct.

Metrics that matter

Metric	What it measures	Direction
Tasks completed per session	Throughput	Higher better
Agent completion rate	Autonomy quality	Higher better — % of tasks finished without human intervention
Rework rate	Output quality	Lower better — % of tasks that needed human correction
Time-to-value	Idea → working feature	Decreasing trend
Skill / context-injection accuracy	Context engineering health	Higher precision and recall
Eval pass rate	Skill-content correctness	Higher better
Context-failure rate	Agent reliability	Lower better — % of tasks where agent went wrong because of bad context

Metrics that don't matter

Metric	Why it's misleading
Lines of code generated	Quantity is free; quality isn't
Commits per day	More commits ≠ more value
Files changed	Breadth says nothing about correctness
Time spent coding	The constraint is not coding time; it is human attention
Number of agents running	More agents can mean more noise, not throughput

The productivity equation

productivity ≈ (tasks completed × quality score) / human attention consumed

The goal is to grow the numerator while shrinking the denominator. "Getting better at AI-native development" is the operational definition of moving this ratio in the right direction over time.

6. The Vibe Hangover — Quality Gates as Compensation

The rapid adoption of AI-assisted coding has produced enough security and quality evidence to make ungated acceptance irresponsible. The exact numbers move across models, tasks, prompts, and study designs, so treat the evidence as a risk signal rather than a permanent multiplier.

Current evidence shape

Source	Stable takeaway
Perry, Srivastava, Kumar, and Boneh, "Do Users Write More Insecure Code with AI Assistants?"	In a controlled security-task study, Codex-assisted participants wrote less secure code and were more likely to believe insecure answers were secure.
"Secure Coding with AI -- From Detection to Repair"	Real-world GPT-generated snippets contained vulnerabilities, but newer models improved at detecting and repairing issues; the correct posture is review-and-repair, not blanket rejection or blind trust.
Snyk secure adoption survey	Organizations are optimistic about AI coding tools while security practitioners report more concern and many teams skip basic preparation such as PoCs and developer training.
OWASP Top 10 for LLM Applications	AI-native systems add security classes such as prompt injection, sensitive information disclosure, supply-chain risk, data/model poisoning, improper output handling, and excessive agency.

These numbers will move; the structural reason will not. AI-generated code has more vulnerabilities because:

Training-data bias. Models learn from public repos that include vulnerable code; popular patterns are not necessarily secure patterns.
Missing context. The model does not know the deployment environment, threat model, or compliance requirements unless explicitly told.
Acceptance bias. Developers scrutinize AI-generated code less than code they wrote themselves ("it looks reasonable").
Speed-vs-security trade-off. Faster output encourages faster acceptance. Security review is slow and feels like friction.
Agentic attack surface. Tool use, repository access, memory, and external context can turn prompt injection or excessive agency into real code, data, or infrastructure impact.

The compensating gates

The defence is mandatory verification between every agent action and production:

Agent generates code
      │
      ▼
[Gate 1] Type checking
      │
      ▼
[Gate 2] Lint / style / safety rules
      │
      ▼
[Gate 3] Automated tests (unit + integration)
      │
      ▼
[Gate 4] Security scanning (deps, secrets, known CVEs)
      │
      ▼
[Gate 5] Design / visual review (for UI changes)
      │
      ▼
[Gate 6] Human spot-check proportional to risk
      │
      ▼
Production

Rule: no gate may be skipped under speed pressure. An agent that passes all gates is trustworthy on this change. An agent that bypasses gates is a liability regardless of how good the diff looks.

The gates are also the justification for higher autonomy. Without the gates, level 4 is reckless. With the gates, level 4 can be responsible for bounded work. Level 5 remains theoretical for production systems unless the organization has automatic rollback, runtime observability, security controls for agent tools, and an explicit human escalation path.

7. Operating Position

A team can name its current operating point on the autonomy slider explicitly. Most production-LLM teams sit between level 3 and level 4: agents implement complete features end-to-end, run quality gates locally, and the human reviews the completed work via a structured session summary or PR rather than line-by-line as it is being written.

Moving up the slider over time is a deliberate engineering project: each step requires the supporting infrastructure to move with it (eval suites, tripwire guardrails on destructive operations, persistent memory across sessions, model-routing logic for matching tasks to model strengths). A team that drifts upward without that infrastructure is drifting toward a regression event, not toward higher productivity.

Moving down the slider is also legitimate: high-stakes work (production deployment, security-sensitive logic, irreversible data operations) should run at lower autonomy regardless of the team's overall position. The slider is a per-task setting, not a team-wide setting.

Verification

Prompts and skill specifications are treated as source code — versioned in git, reviewed before merge, covered by evals where useful
Every agent session operates at an intentional autonomy level chosen for the task's risk, not the harness's default
Quality gates exist between agent output and production: type check, lint, automated tests, security scan, plus human review proportional to task risk
LLM-application risks such as prompt injection, sensitive information disclosure, supply-chain risk, and excessive agency are considered when agents can read, write, or call tools
Productivity is measured by outcomes (tasks completed, rework rate, time-to-value), not by output volume (LoC, commit count)
Knowledge is captured durably (skill files, decision records, structured session summaries) rather than lost between sessions
Auto-improve loops are constrained per the AutoResearch pattern — one editable asset, one scalar metric, one time box
Security regressions known to come from AI-generated code (data exposure, weak auth, accepted-but-vulnerable patterns) are explicitly mitigated by the gate stack
Vibe-coding patterns are limited to throwaway prototypes; production work runs as agentic engineering
The team can answer "what is our current autonomy level on this task?" without ambiguity, and the answer is justified by the task's risk profile

Do NOT Use When

Use instead	When
`prompt-craft`	Authoring or improving a specific prompt — the per-prompt discipline below this skill's conceptual frame
`agent-engineering`	Designing the production-reliability layer for an agent system: orchestration patterns, error budgets, observability, fault tolerance
`code-review`	Reviewing the AI-generated code that comes out of a Software 3.0 workflow — this skill frames why the review is needed; code-review is how the review is done
`tool-call-strategy`	The tactical layer of which tool an agent should call when, in what order, with what fallback
`skill-router`	The cross-skill dispatch decision (which skill activates for a query) — this skill is meta about why a skill library exists at all
`debugging`	An autonomous loop has stalled, regressed, or is producing wrong output and you need to chase the root cause

Skill Graph context

Classification

Subject: agent-ops
Public: true
Domain: agent/concepts
Scope: Reasoning about agent autonomy levels, auto-improve loops, AI-generated code quality, and productivity in LLM-assisted codebases — Karpathy's three eras (1.0 explicit / 2.0 learned / 3.0 natural-language), the vibe-coding-vs-agentic-engineering distinction, the 0–5 autonomy slider with task-type recommendations, the one-asset/one-metric/one-time-box AutoResearch loop, Software 3.0 productivity metrics, and the documented regressions of ungated AI-generated code (the 'vibe hangover'). Portable across any LLM-assisted codebase; principle-grounded, not repo-bound. Excludes choosing a specific autonomy-loop topology (agent-engineering), per-prompt authoring discipline (prompt-craft), and reviewing the AI-generated code itself (code-review).

When to use

we keep accepting agent-generated code on first try and shipping bugs — what discipline replaces this?
what autonomy level should I run for a security-sensitive change?
does measuring lines-of-code per session make sense when an agent generates the code?
the team is treating prompts and skill files like throwaway notes — what's the alternative framing?
we want an auto-improve loop for our skill content — how do we constrain it so it doesn't regress?
what's the conceptual difference between a vibe coding session and an agentic engineering session?
AI-generated code is shipping with vulnerabilities — what gates should sit between agent output and production?
how do I match autonomy level to the risk profile of the task?

Not for

improve this specific prompt for the grader
review this AI-generated PR for correctness
design the checkpoint state machine for our loop
scaffold a new skill that codifies our coding doctrine
the autonomous loop is stalling — debug it

Related skills

Verify with: code-review, testing-strategy
Related: prompt-craft, skill-router, tool-call-strategy, agent-engineering, code-review

Grounding

Mode: hybrid
Truth sources: https://www.youtube.com/watch?v=LCEmiRjPEtQ, https://github.com/karpathy/autoresearch, https://arxiv.org/abs/2211.03622, https://arxiv.org/abs/2504.20814, https://snyk.io/lp/secure-adoption-in-the-genai-era/, https://owasp.org/www-project-top-10-for-large-language-model-applications/

Keywords

software 3.0 concepts, vibe coding, agentic engineering doctrine, autonomy slider, prompt as code, karpathy three eras, autoresearch loop, ai-generated code quality, vibe hangover, llm-native development