gemma4-local-deploy - SKILL.md Agent Skill

name: gemma4-local-deploy description: 在本机 Mac 或 Apple Silicon 上部署 Gemma 4 12B。本地安装/升级 llama.cpp，下载 GGUF 量化模型，用 llama-server 暴露 OpenAI-compatible API，或用 Ollama 暴露本地模型服务；按用户需求在默认 Q4_K_M、64K/128K 长上下文、QAT Q4_0 @ 256K、左右对比演示之间选择，配置 tmux 后台运行，验证健康检查、问答接口、资源占用和常见故障。当用户说部署 Gemma 4、Gemma 4 12B、本地大模型、长上下文、QAT、量化、llama-server、Ollama、GGUF、Mac 本地模型服务时使用。 allowed-tools: Bash, Read, WebSearch, WebFetch metadata: argument-hint: "[模型量化/端口/是否后台运行]"

Gemma 4 12B 本地部署

目标：把 Gemma 4 12B 的 GGUF 版本部署成本机模型服务。默认用 llama.cpp / llama-server + Apple Metal + Q4_K_M + tmux 暴露 OpenAI-compatible API；用户明确要 QAT、256K 或对比演示时，切到 QAT Q4_0 profile；用户明确要 Ollama 时，再走 Ollama 导入路径。

默认选择

默认模型仓库：ggml-org/gemma-4-12B-it-GGUF
默认量化：Q4_K_M
默认模型名：gemma-4-12b-it
默认端口：127.0.0.1:8080
默认上下文：32768
12B 长上下文：用户明确要求更大上下文时，可改为 65536 或原生最高 131072
QAT 仓库：google/gemma-4-12B-it-qat-q4_0-gguf
QAT 量化：Q4_0，文件名通常是 gemma-4-12b-it-qat-q4_0.gguf
QAT 上下文：用户要求 QAT、最大上下文或 256K 时，用 262144
默认后台方式：tmux 会话 gemma4-12b
默认关闭 thinking：--reasoning off，避免 OpenAI API 的 message.content 为空
Ollama 路径：只在用户明确要 Ollama、需要接 Ollama 生态，或询问 ollama pull gemma4:12b 时使用

如果用户明确要更高质量，优先建议 Q6_K 或 Q8_0；不要默认上 bf16，除非用户接受更大内存和更慢加载。QAT 是训练时模拟量化以降低压缩后的质量损失，不等于无损；关键任务仍要做当前会话验证。

Profile 选择

先根据用户目标选择 profile。不要把 256K 当作默认值，也不要在用户只要日常本地服务时自动切 QAT。

Profile	When to choose	Model / quant	Context	Port / alias
`daily-q4km-32k`	默认日常聊天、编码、低风险本地 API	`ggml-org/...:Q4_K_M`	`32768`	`8080` / `gemma-4-12b-it`
`long-q4km-128k`	用户明确要更长上下文，但仍想保留默认 GGUF 路线	`ggml-org/...:Q4_K_M`	`65536` or `131072`	`8080` / `gemma-4-12b-it`
`qat-q4_0-256k`	用户说 QAT、Q4_0、256K、Google QAT blog、低内存长上下文	`google/...qat-q4_0-gguf:Q4_0`	`262144`	`8080` / `gemma-4-12b-it-qat-q4_0`
`compare-32k-vs-256k`	用户要录屏、演示、A/B 对比资源和速度	left `Q4_K_M`, right `QAT Q4_0`	`32768` + `262144`	`8080` + `8081`

选择后在最终回复里说清楚 profile、端口、上下文和为什么这么选。

执行流程

1. 搜索并确认现状

先查已有安装、进程、端口和模型缓存，避免重复部署：

command -v llama-server || true
llama-server --version || true
tmux has-session -t gemma4-12b 2>/dev/null && tmux display-message -p -t gemma4-12b '#S #{pane_pid}' || true
lsof -nP -iTCP:8080 -sTCP:LISTEN || true
ls -lh "$HOME/Library/Caches/llama.cpp/"*gemma-4-12B-it*Q4_K_M*.gguf 2>/dev/null || true
find "$HOME/Library/Caches/llama.cpp" "$HOME/Models" \( -name '*gemma-4-12b-it-qat-q4_0*.gguf' -o -name '*gemma-4-12B-it-qat-q4_0*.gguf' \) 2>/dev/null || true

On Mac, also record hardware:

system_profiler SPHardwareDataType | sed -n '1,30p'
df -h "$HOME"

2. Install or upgrade llama.cpp

Use Homebrew on macOS:

brew install llama.cpp
# If already installed, upgrade only this package when possible.
brew upgrade llama.cpp
llama-server --version

Gemma 4 GGUF requires a llama.cpp build that recognizes general.architecture = gemma4. If loading fails with:

unknown model architecture: 'gemma4'

then upgrade llama.cpp and retry. A verified good local build was 9430; newer stable or HEAD is also acceptable.

3. Download/load the model

For the default daily-q4km-32k profile, first-run download can be done by llama-server -hf:

llama-server \
  -hf ggml-org/gemma-4-12B-it-GGUF:Q4_K_M \
  --no-mmproj \
  --ctx-size 32768 \
  --gpu-layers 99 \
  --parallel 1 \
  --reasoning off \
  --host 127.0.0.1 \
  --port 8080 \
  --alias gemma-4-12b-it

After the model is cached, prefer starting with the local file path. Typical cache path:

$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf

For the qat-q4_0-256k profile, use the Google QAT GGUF repo:

llama-server \
  -hf google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 \
  --ctx-size 262144 \
  --gpu-layers 99 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning off \
  --host 127.0.0.1 \
  --port 8080 \
  --alias gemma-4-12b-it-qat-q4_0

If the download should be explicit or reusable outside the llama.cpp cache:

mkdir -p "$HOME/Models/gemma4-qat"
huggingface-cli download google/gemma-4-12B-it-qat-q4_0-gguf \
  gemma-4-12b-it-qat-q4_0.gguf \
  --local-dir "$HOME/Models/gemma4-qat"

4. Run persistently with tmux

If port 8080 is free and no gemma4-12b session exists:

tmux new-session -d -s gemma4-12b 'llama-server -m "$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf" --ctx-size 32768 --gpu-layers 99 --parallel 1 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it'

If $HOME is not expanded inside single quotes in the target shell, use the absolute path instead.

For qat-q4_0-256k with an explicit local file:

tmux new-session -d -s gemma4-qat-256k 'llama-server -m "$HOME/Models/gemma4-qat/gemma-4-12b-it-qat-q4_0.gguf" --ctx-size 262144 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it-qat-q4_0'

For compare-32k-vs-256k, keep separate session names and ports:

tmux new-session -d -s gemma4-left-32k 'llama-server -m "$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf" --ctx-size 32768 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it'
tmux new-session -d -s gemma4-right-256k 'llama-server -m "$HOME/Models/gemma4-qat/gemma-4-12b-it-qat-q4_0.gguf" --ctx-size 262144 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8081 --alias gemma-4-12b-it-qat-q4_0'

Management commands:

tmux attach -t gemma4-12b
tmux kill-session -t gemma4-12b
tmux kill-session -t gemma4-qat-256k
tmux kill-session -t gemma4-left-32k
tmux kill-session -t gemma4-right-256k

5. Increase 12B context when requested

Do not tell the user 12B is limited to 32K. 32768 is the conservative default startup value. The 12B GGUF metadata can support a native training context of 131072.

Use this selection table:

User need	`--ctx-size`	Notes
Fast daily chat / low memory	`32768`	Default.
Long coding sessions or medium documents	`65536`	Good balance on 16GB+ Macs if memory pressure is acceptable.
Max native 12B context	`131072`	Use when the user explicitly asks for larger or maximum context. Expect higher RSS and lower speed.
Beyond native context	Avoid by default	Requires RoPE/YaRN scaling and quality can degrade; explain risk before trying.

Restart with a larger context, keeping --parallel 1 and using Flash Attention plus quantized KV cache to reduce long-context pressure:

tmux kill-session -t gemma4-12b 2>/dev/null || true
tmux new-session -d -s gemma4-12b 'llama-server -m "$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf" --ctx-size 131072 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it'

If the model path is different, find it first:

find "$HOME/Library/Caches/llama.cpp" "$HOME/Models" -name '*gemma-4-12B-it*Q4_K_M*.gguf' 2>/dev/null

After restart, prove the actual context value instead of relying on the command line:

curl -fsS http://127.0.0.1:8080/v1/models | jq '.data[0].meta | {n_ctx, n_ctx_train, n_params, size}'

Expected long-context 12B result:

{
  "n_ctx": 131072,
  "n_ctx_train": 131072
}

If startup fails or memory pressure is high, retry --ctx-size 65536.

6. Verify before claiming success

Run all three checks from the current session:

curl -fsS http://127.0.0.1:8080/health
curl -fsS http://127.0.0.1:8080/v1/models
curl -fsS http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma-4-12b-it","messages":[{"role":"user","content":"用一句中文回答：现在可以问你问题吗？"}],"max_tokens":80,"temperature":0.2}'

Success requires:

/health returns {"status":"ok"}
/v1/models lists the chosen alias, for example gemma-4-12b-it or gemma-4-12b-it-qat-q4_0
/v1/models shows the requested n_ctx when the user asked for larger context
chat response has non-empty choices[0].message.content

7. Report resource usage

Use both process and macOS footprint views:

pid=$(pgrep -f 'llama-server .*gemma-4-12B-it-Q4_K_M.gguf' | head -1)
ps -p "$pid" -o pid,stat,%cpu,%mem,rss,vsz,etime,command
footprint -p "$pid" -summary 2>/dev/null | sed -n '1,80p'
memory_pressure | sed -n '1,20p'

Explain the difference clearly:

GGUF Q4_K_M is a quantized model; its file is about 7GB, not 24GB full precision.
ps RSS includes mapped model pages and often shows around 9-11GB for Q4 12B.
footprint may show lower physical pressure because clean mmap pages can be discarded and reread.
Apple Silicon uses unified memory; GPU work does not appear as a separate NVIDIA-style VRAM number.
Larger --ctx-size increases KV/cache memory and may reduce tokens/sec even when the prompt is short.

8. Optional: Ollama route

Use this path only when the user asks for Ollama. Treat official Ollama registry state as live-changing: re-test before claiming support.

Install and start Ollama:

brew install ollama
ollama --version
lsof -nP -iTCP:11434 -sTCP:LISTEN || true
tmux new-session -d -s ollama-gemma4 'OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve'
curl -fsS http://127.0.0.1:11434/api/version

First try the official path:

ollama pull gemma4:12b

If it succeeds, run:

ollama run gemma4:12b "用一句中文回答：现在可以问你问题吗？"

If it fails with pull model manifest: file does not exist, fall back to GGUF import. Download or reuse a local GGUF:

mkdir -p "$HOME/Models/gemma4-12b"
huggingface-cli download ggml-org/gemma-4-12B-it-GGUF \
  gemma-4-12B-it-Q4_K_M.gguf \
  --local-dir "$HOME/Models/gemma4-12b"

Create a Modelfile:

cat > "$HOME/Models/gemma4-12b/Modelfile" <<EOF
FROM $HOME/Models/gemma4-12b/gemma-4-12B-it-Q4_K_M.gguf
EOF

Homebrew ollama builds may lack sidecar llama-server / llama-quantize binaries. If ollama create or ollama run reports either binary missing, create a stable working directory with symlinks to llama.cpp:

mkdir -p "$HOME/ollama-gemma4/build/lib/ollama"
ln -sf /opt/homebrew/bin/llama-server "$HOME/ollama-gemma4/build/lib/ollama/llama-server"
ln -sf /opt/homebrew/bin/llama-quantize "$HOME/ollama-gemma4/build/lib/ollama/llama-quantize"
tmux kill-session -t ollama-gemma4 2>/dev/null || true
tmux new-session -d -s ollama-gemma4 "cd '$HOME/ollama-gemma4' && OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve"

Import and run:

ollama create gemma4-12b-gguf-local -f "$HOME/Models/gemma4-12b/Modelfile"
ollama list
ollama run gemma4-12b-gguf-local "用一句中文回答：Ollama 能跑 Gemma 4 12B 吗？"

For Ollama success, report whether it was:

official registry pull: ollama pull gemma4:12b
manual GGUF import: ollama create gemma4-12b-gguf-local
workaround needed: sidecar symlinks to llama.cpp

Troubleshooting

Symptom	Fix
`unknown model architecture: 'gemma4'`	Upgrade `llama.cpp`; old builds do not support Gemma 4 GGUF.
Port 8080 busy	Show the listener with `lsof`; either stop it or choose another port.
Chat `content` is empty and only reasoning appears	Restart with `--reasoning off`.
First-run `-hf` hangs or repeats metadata resolution	Use the cached local GGUF path with `-m`.
`ollama pull gemma4:12b` returns `pull model manifest: file does not exist`	Official registry tag is not ready or is temporarily inconsistent; use manual GGUF import.
Ollama reports `llama-server binary not found` or `llama-quantize binary not found`	Symlink those binaries from `llama.cpp` into the Ollama working directory and start `ollama serve` from there.
User wants image/multimodal	Remove `--no-mmproj` only after testing `mmproj`; text-only deployment is the stable default.
Memory too high	Lower context, use `Q4_K_M`, or reduce `--parallel` to `1`.

Final response shape

Answer in Chinese unless the user asks otherwise. Include:

endpoint URL
model id
tmux/session management commands
verification results from this session
resource summary and any caveats

Cross-Check Agent

Use agents/openai.yaml only when the deployment plan or troubleshooting result needs an independent model review before execution.