gemma4-local-deploy

star 213

在本机 Mac 或 Apple Silicon 上部署 Gemma 4 12B。本地安装/升级 llama.cpp,下载 GGUF 量化模型,用 llama-server 暴露 OpenAI-compatible API,或用 Ollama 暴露本地模型服务;按用户需求在默认 Q4_K_M、64K/128K 长上下文、QAT Q4_0 @ 256K、左右对比演示之间选择,配置 tmux 后台运行,验证健康检查、问答接口、资源占用和常见故障。当用户说部署 Gemma 4、Gemma 4 12B、本地大模型、长上下文、QAT、量化、llama-server、Ollama、GGUF、Mac 本地模型服务时使用。

majiayu000 By majiayu000 schedule Updated 6/6/2026

name: gemma4-local-deploy description: 在本机 Mac 或 Apple Silicon 上部署 Gemma 4 12B。本地安装/升级 llama.cpp,下载 GGUF 量化模型,用 llama-server 暴露 OpenAI-compatible API,或用 Ollama 暴露本地模型服务;按用户需求在默认 Q4_K_M、64K/128K 长上下文、QAT Q4_0 @ 256K、左右对比演示之间选择,配置 tmux 后台运行,验证健康检查、问答接口、资源占用和常见故障。当用户说部署 Gemma 4、Gemma 4 12B、本地大模型、长上下文、QAT、量化、llama-server、Ollama、GGUF、Mac 本地模型服务时使用。 allowed-tools: Bash, Read, WebSearch, WebFetch metadata: argument-hint: "[模型量化/端口/是否后台运行]"

Gemma 4 12B 本地部署

目标:把 Gemma 4 12B 的 GGUF 版本部署成本机模型服务。默认用 llama.cpp / llama-server + Apple Metal + Q4_K_M + tmux 暴露 OpenAI-compatible API;用户明确要 QAT、256K 或对比演示时,切到 QAT Q4_0 profile;用户明确要 Ollama 时,再走 Ollama 导入路径。

默认选择

  • 默认模型仓库:ggml-org/gemma-4-12B-it-GGUF
  • 默认量化:Q4_K_M
  • 默认模型名:gemma-4-12b-it
  • 默认端口:127.0.0.1:8080
  • 默认上下文:32768
  • 12B 长上下文:用户明确要求更大上下文时,可改为 65536 或原生最高 131072
  • QAT 仓库:google/gemma-4-12B-it-qat-q4_0-gguf
  • QAT 量化:Q4_0,文件名通常是 gemma-4-12b-it-qat-q4_0.gguf
  • QAT 上下文:用户要求 QAT、最大上下文或 256K 时,用 262144
  • 默认后台方式:tmux 会话 gemma4-12b
  • 默认关闭 thinking:--reasoning off,避免 OpenAI API 的 message.content 为空
  • Ollama 路径:只在用户明确要 Ollama、需要接 Ollama 生态,或询问 ollama pull gemma4:12b 时使用

如果用户明确要更高质量,优先建议 Q6_KQ8_0;不要默认上 bf16,除非用户接受更大内存和更慢加载。QAT 是训练时模拟量化以降低压缩后的质量损失,不等于无损;关键任务仍要做当前会话验证。

Profile 选择

先根据用户目标选择 profile。不要把 256K 当作默认值,也不要在用户只要日常本地服务时自动切 QAT。

Profile When to choose Model / quant Context Port / alias
daily-q4km-32k 默认日常聊天、编码、低风险本地 API ggml-org/...:Q4_K_M 32768 8080 / gemma-4-12b-it
long-q4km-128k 用户明确要更长上下文,但仍想保留默认 GGUF 路线 ggml-org/...:Q4_K_M 65536 or 131072 8080 / gemma-4-12b-it
qat-q4_0-256k 用户说 QAT、Q4_0、256K、Google QAT blog、低内存长上下文 google/...qat-q4_0-gguf:Q4_0 262144 8080 / gemma-4-12b-it-qat-q4_0
compare-32k-vs-256k 用户要录屏、演示、A/B 对比资源和速度 left Q4_K_M, right QAT Q4_0 32768 + 262144 8080 + 8081

选择后在最终回复里说清楚 profile、端口、上下文和为什么这么选。

执行流程

1. 搜索并确认现状

先查已有安装、进程、端口和模型缓存,避免重复部署:

command -v llama-server || true
llama-server --version || true
tmux has-session -t gemma4-12b 2>/dev/null && tmux display-message -p -t gemma4-12b '#S #{pane_pid}' || true
lsof -nP -iTCP:8080 -sTCP:LISTEN || true
ls -lh "$HOME/Library/Caches/llama.cpp/"*gemma-4-12B-it*Q4_K_M*.gguf 2>/dev/null || true
find "$HOME/Library/Caches/llama.cpp" "$HOME/Models" \( -name '*gemma-4-12b-it-qat-q4_0*.gguf' -o -name '*gemma-4-12B-it-qat-q4_0*.gguf' \) 2>/dev/null || true

On Mac, also record hardware:

system_profiler SPHardwareDataType | sed -n '1,30p'
df -h "$HOME"

2. Install or upgrade llama.cpp

Use Homebrew on macOS:

brew install llama.cpp
# If already installed, upgrade only this package when possible.
brew upgrade llama.cpp
llama-server --version

Gemma 4 GGUF requires a llama.cpp build that recognizes general.architecture = gemma4. If loading fails with:

unknown model architecture: 'gemma4'

then upgrade llama.cpp and retry. A verified good local build was 9430; newer stable or HEAD is also acceptable.

3. Download/load the model

For the default daily-q4km-32k profile, first-run download can be done by llama-server -hf:

llama-server \
  -hf ggml-org/gemma-4-12B-it-GGUF:Q4_K_M \
  --no-mmproj \
  --ctx-size 32768 \
  --gpu-layers 99 \
  --parallel 1 \
  --reasoning off \
  --host 127.0.0.1 \
  --port 8080 \
  --alias gemma-4-12b-it

After the model is cached, prefer starting with the local file path. Typical cache path:

$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf

For the qat-q4_0-256k profile, use the Google QAT GGUF repo:

llama-server \
  -hf google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 \
  --ctx-size 262144 \
  --gpu-layers 99 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning off \
  --host 127.0.0.1 \
  --port 8080 \
  --alias gemma-4-12b-it-qat-q4_0

If the download should be explicit or reusable outside the llama.cpp cache:

mkdir -p "$HOME/Models/gemma4-qat"
huggingface-cli download google/gemma-4-12B-it-qat-q4_0-gguf \
  gemma-4-12b-it-qat-q4_0.gguf \
  --local-dir "$HOME/Models/gemma4-qat"

4. Run persistently with tmux

If port 8080 is free and no gemma4-12b session exists:

tmux new-session -d -s gemma4-12b 'llama-server -m "$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf" --ctx-size 32768 --gpu-layers 99 --parallel 1 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it'

If $HOME is not expanded inside single quotes in the target shell, use the absolute path instead.

For qat-q4_0-256k with an explicit local file:

tmux new-session -d -s gemma4-qat-256k 'llama-server -m "$HOME/Models/gemma4-qat/gemma-4-12b-it-qat-q4_0.gguf" --ctx-size 262144 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it-qat-q4_0'

For compare-32k-vs-256k, keep separate session names and ports:

tmux new-session -d -s gemma4-left-32k 'llama-server -m "$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf" --ctx-size 32768 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it'
tmux new-session -d -s gemma4-right-256k 'llama-server -m "$HOME/Models/gemma4-qat/gemma-4-12b-it-qat-q4_0.gguf" --ctx-size 262144 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8081 --alias gemma-4-12b-it-qat-q4_0'

Management commands:

tmux attach -t gemma4-12b
tmux kill-session -t gemma4-12b
tmux kill-session -t gemma4-qat-256k
tmux kill-session -t gemma4-left-32k
tmux kill-session -t gemma4-right-256k

5. Increase 12B context when requested

Do not tell the user 12B is limited to 32K. 32768 is the conservative default startup value. The 12B GGUF metadata can support a native training context of 131072.

Use this selection table:

User need --ctx-size Notes
Fast daily chat / low memory 32768 Default.
Long coding sessions or medium documents 65536 Good balance on 16GB+ Macs if memory pressure is acceptable.
Max native 12B context 131072 Use when the user explicitly asks for larger or maximum context. Expect higher RSS and lower speed.
Beyond native context Avoid by default Requires RoPE/YaRN scaling and quality can degrade; explain risk before trying.

Restart with a larger context, keeping --parallel 1 and using Flash Attention plus quantized KV cache to reduce long-context pressure:

tmux kill-session -t gemma4-12b 2>/dev/null || true
tmux new-session -d -s gemma4-12b 'llama-server -m "$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf" --ctx-size 131072 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it'

If the model path is different, find it first:

find "$HOME/Library/Caches/llama.cpp" "$HOME/Models" -name '*gemma-4-12B-it*Q4_K_M*.gguf' 2>/dev/null

After restart, prove the actual context value instead of relying on the command line:

curl -fsS http://127.0.0.1:8080/v1/models | jq '.data[0].meta | {n_ctx, n_ctx_train, n_params, size}'

Expected long-context 12B result:

{
  "n_ctx": 131072,
  "n_ctx_train": 131072
}

If startup fails or memory pressure is high, retry --ctx-size 65536.

6. Verify before claiming success

Run all three checks from the current session:

curl -fsS http://127.0.0.1:8080/health
curl -fsS http://127.0.0.1:8080/v1/models
curl -fsS http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma-4-12b-it","messages":[{"role":"user","content":"用一句中文回答:现在可以问你问题吗?"}],"max_tokens":80,"temperature":0.2}'

Success requires:

  • /health returns {"status":"ok"}
  • /v1/models lists the chosen alias, for example gemma-4-12b-it or gemma-4-12b-it-qat-q4_0
  • /v1/models shows the requested n_ctx when the user asked for larger context
  • chat response has non-empty choices[0].message.content

7. Report resource usage

Use both process and macOS footprint views:

pid=$(pgrep -f 'llama-server .*gemma-4-12B-it-Q4_K_M.gguf' | head -1)
ps -p "$pid" -o pid,stat,%cpu,%mem,rss,vsz,etime,command
footprint -p "$pid" -summary 2>/dev/null | sed -n '1,80p'
memory_pressure | sed -n '1,20p'

Explain the difference clearly:

  • GGUF Q4_K_M is a quantized model; its file is about 7GB, not 24GB full precision.
  • ps RSS includes mapped model pages and often shows around 9-11GB for Q4 12B.
  • footprint may show lower physical pressure because clean mmap pages can be discarded and reread.
  • Apple Silicon uses unified memory; GPU work does not appear as a separate NVIDIA-style VRAM number.
  • Larger --ctx-size increases KV/cache memory and may reduce tokens/sec even when the prompt is short.

8. Optional: Ollama route

Use this path only when the user asks for Ollama. Treat official Ollama registry state as live-changing: re-test before claiming support.

Install and start Ollama:

brew install ollama
ollama --version
lsof -nP -iTCP:11434 -sTCP:LISTEN || true
tmux new-session -d -s ollama-gemma4 'OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve'
curl -fsS http://127.0.0.1:11434/api/version

First try the official path:

ollama pull gemma4:12b

If it succeeds, run:

ollama run gemma4:12b "用一句中文回答:现在可以问你问题吗?"

If it fails with pull model manifest: file does not exist, fall back to GGUF import. Download or reuse a local GGUF:

mkdir -p "$HOME/Models/gemma4-12b"
huggingface-cli download ggml-org/gemma-4-12B-it-GGUF \
  gemma-4-12B-it-Q4_K_M.gguf \
  --local-dir "$HOME/Models/gemma4-12b"

Create a Modelfile:

cat > "$HOME/Models/gemma4-12b/Modelfile" <<EOF
FROM $HOME/Models/gemma4-12b/gemma-4-12B-it-Q4_K_M.gguf
EOF

Homebrew ollama builds may lack sidecar llama-server / llama-quantize binaries. If ollama create or ollama run reports either binary missing, create a stable working directory with symlinks to llama.cpp:

mkdir -p "$HOME/ollama-gemma4/build/lib/ollama"
ln -sf /opt/homebrew/bin/llama-server "$HOME/ollama-gemma4/build/lib/ollama/llama-server"
ln -sf /opt/homebrew/bin/llama-quantize "$HOME/ollama-gemma4/build/lib/ollama/llama-quantize"
tmux kill-session -t ollama-gemma4 2>/dev/null || true
tmux new-session -d -s ollama-gemma4 "cd '$HOME/ollama-gemma4' && OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve"

Import and run:

ollama create gemma4-12b-gguf-local -f "$HOME/Models/gemma4-12b/Modelfile"
ollama list
ollama run gemma4-12b-gguf-local "用一句中文回答:Ollama 能跑 Gemma 4 12B 吗?"

For Ollama success, report whether it was:

  • official registry pull: ollama pull gemma4:12b
  • manual GGUF import: ollama create gemma4-12b-gguf-local
  • workaround needed: sidecar symlinks to llama.cpp

Troubleshooting

Symptom Fix
unknown model architecture: 'gemma4' Upgrade llama.cpp; old builds do not support Gemma 4 GGUF.
Port 8080 busy Show the listener with lsof; either stop it or choose another port.
Chat content is empty and only reasoning appears Restart with --reasoning off.
First-run -hf hangs or repeats metadata resolution Use the cached local GGUF path with -m.
ollama pull gemma4:12b returns pull model manifest: file does not exist Official registry tag is not ready or is temporarily inconsistent; use manual GGUF import.
Ollama reports llama-server binary not found or llama-quantize binary not found Symlink those binaries from llama.cpp into the Ollama working directory and start ollama serve from there.
User wants image/multimodal Remove --no-mmproj only after testing mmproj; text-only deployment is the stable default.
Memory too high Lower context, use Q4_K_M, or reduce --parallel to 1.

Final response shape

Answer in Chinese unless the user asks otherwise. Include:

  • endpoint URL
  • model id
  • tmux/session management commands
  • verification results from this session
  • resource summary and any caveats

Cross-Check Agent

Use agents/openai.yaml only when the deployment plan or troubleshooting result needs an independent model review before execution.

Install via CLI
npx skills add https://github.com/majiayu000/spellbook --skill gemma4-local-deploy
Repository Details
star Stars 213
call_split Forks 21
navigation Branch main
article Path SKILL.md
More from Creator