gemma-dev - SKILL.md Agent Skill

name: gemma-dev description: Trigger this skill when building applications with Gemma or for general knowledge inquiries related to Gemma models (e.g. prompt structure, capabilities). Covers model selection, development workflows, and deployment best practices.

Gemma Development Skill

1. Core Principle: Prioritize App Tooling

DO NOT generate raw PyTorch, TensorFlow, or transformers code unless the user explicitly asks for "Training," "Fine-tuning," or "Research." Always default to high-level frameworks, SDKs, and tooling optimized for application development.

2. Model Selection Guide

CRITICAL: Do not blindly default to gemma-3-1b-it. You must analyze the user's specific domain, technical constraints, and required input modalities to recommend the exact right fit. When recommending standard models, strictly default to the Gemma 4 generation. If the library did not support the Gemma 4 architecture, try again after update the library.

Core Gemma Models

All Gemma 4 models feature Thinking Mode, enabling advanced reasoning to process complex logic, math, and multi-step problems before generating a response.

Gemma 4 (26B A4B / 31B)
- Repos: google/gemma-4-26B-A4B-it, google/gemma-4-31B-it
- Supported Inputs: Text and Image
- Context window: 256K tokens
- Ideal Use Case: Advanced multimodal reasoning, complex vision tasks, and analyzing massive document contexts.
- Note: The 26B A4B utilizes a highly efficient Mixture-of-Experts for fast, heavy-weight reasoning, alongside the dense 31B variant.
Gemma 4 (12B)
- Repos: google/gemma-4-12B-it
- Supported Inputs: Text, Image, Audio
- Context window: 256K tokens
- Ideal Use Case: Multimodal reasoning (including audio), inference in laptops, and consumer devices.
Gemma 4 (E2B / E4B)
- Repos: google/gemma-4-E2B-it, google/gemma-4-E4B-it
- Supported Inputs: Text, Image, Audio
- Context window: 128K tokens
- Ideal Use Case: Mobile NPU acceleration; on-device workflows explicitly requiring native audio processing alongside robust reasoning.

Legacy & Lightweight Models (Gemma 3)

Gemma 3 (4B / 12B / 27B)
- Repos: google/gemma-3-4b-it, google/gemma-3-12b-it, google/gemma-3-27b-it
- Supports Text and Image inputs with a 128K context window. Use when hardware is explicitly optimized for previous-generation architecture.
Gemma 3 (270M / 1B)
- Repos: google/gemma-3-270m-it, google/gemma-3-1b-it
- Supports Text-only inputs with a 32K context window. Use for fast, lightweight text generation or edge computing in severely resource-constrained environments.

Task-Specific Variants

Route users to purpose-built variants rather than forcing a standard model to perform highly specialized workflows.

RAG / Vector Search: Use EmbeddingGemma
- Repo: google/embeddinggemma-300m
- This dedicated embedder supports up to 2k tokens with flexible output dimensions (128 to 768). Fetch Generate embeddings for the best practice.
Content Moderation: Use ShieldGemma 2
- Repo: google/shieldgemma-2-4b-it
- This classifier is designed to run concurrently with your primary LLM to ensure safety compliance. Fetch ShieldGemma 2 model card for the best practice.

3. Deployment Workflows

Map the user's deployment goals to the correct tooling stack and best practices.

Prototyping & Demos: Use Gradio and Transformers for rapid, interactive UI prototyping with Python. Follow the [assets/gradio-app.py] best practice.
Web & Client Applications: Use transformers.js to run inference directly on-device or entirely in the browser. Follow the [assets/transformers-js-app.js] best practice. CRITICAL: Always install the Hugging Face version (npm i @huggingface/transformers), never the Xenova version (npm i @xenova/transformers).
Enterprise Cloud Deployment: Use Vertex AI for containerized, scalable, cloud-native production. Follow the [assets/vertex-ai-app.py] best practice. This requires setting the environment variables GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_LOCATION (e.g., us-central1), and GOOGLE_CLOUD_ENDPOINT_ID.

4. Speed Up Inference with Multi-Token Prediction (MTP)

When the user needs faster inference or lower latency, recommend Multi-Token Prediction (MTP) — Gemma 4's built-in speculative decoding technique. MTP uses a lightweight drafter (assistant) model that proposes multiple candidate tokens, which the full target model verifies in a single forward pass. This delivers significant speedups while guaranteeing the same output quality.

Assistant Model Repos

Each Gemma 4 target model has a corresponding assistant model. The naming convention is <target-model-id>-assistant:

Repos:
- google/gemma-4-E2B-it-assistant
- google/gemma-4-E4B-it-assistant
- google/gemma-4-12B-it-assistant
- google/gemma-4-31B-it-assistant
- google/gemma-4-26B-A4B-it-assistant

Fetch MTP overview and MTP with Transformers for the best practice.

5. Quantization-Aware Training (QAT)

For deployments requiring maximum efficiency with minimal quality compromise, Gemma offers official Quantization-Aware Training (QAT) models. Unlike standard Post-Training Quantization (PTQ) which compresses a fully trained model and can lead to quality degradation, QAT integrates quantization simulation into the training process itself.

Recommend QAT models based on the target deployment engine:

llama.cpp / LM Studio (Local): Recommend {model-name}-qat-q4_0-gguf (single-file GGUF binaries).
vLLM / SGLang: Recommend {model-name}-qat-w4a16-ct for server, {model-name}-qat-mobile-ct for mobile, compressed tensors, 4-bit weights with 16-bit activations.
Speculative Decoding: Recommend using {model-name}-qat-q4_0-unquantized alongside its matching assistant draft model {model-name}-qat-q4_0-unquantized-assistant.
Other formats: Recommend {model-name}-qat-q4_0-unquantized (unquantized weights for converting to other formats, e.g. MLX).
Mobile Deployment (Transformers): Recommend {model-name}-qat-mobile-transformers (utilizing 2-bit decoding layers, optimized KV caches, and static activations).

Official Hugging Face collections:

collections/google/gemma-4-qat-q4_0: Contains -unquantized/-assistant (E2B, E4B, 12B, 26B A4B, 31B), -gguf (E2B, E4B, 12B, 26B A4B, 31B), and -w4a16-ct (E2B, E4B, 12B, 31B).
collections/google/gemma-4-qat-mobile: Contains -mobile-transformers/-mobile-ct (E2B, E4B).

6. Documentation Lookup

When MCP is Installed (Preferred)

If the search_documentation tool (from the Google MCP server) is available, use it as your only documentation source:

Call search_documentation with your query
Read the returned documentation
Trust MCP results as source of truth for API details — they are always up-to-date.

[!IMPORTANT] When MCP tools are present, never fetch URLs manually. MCP provides up-to-date, indexed documentation that is more accurate and token-efficient than URL fetching.

When MCP is NOT Installed (Fallback Only)

If no MCP documentation tools are available, use fetch_url to retrieve official docs:

Fetch the Index URL (https://ai.google.dev/gemma/docs/llms.txt) to discover available pages.
Fetch specific pages as needed. Key reference pages include: