name: deploy-edge-ai-model locale: caveman-ultra source_locale: en source_commit: 82c77053 translator: "Julius Brussee homage — caveman" translation_date: "2026-04-19" description: > Deploy machine learning models to edge devices using Google AI Edge Gallery, TensorFlow Lite, ONNX Runtime, and MediaPipe. Covers model quantization (INT8/INT4), on-device inference with Gemma 4 models, Android/iOS deployment via AI Edge Gallery, hardware delegate selection (GPU/NPU/DSP), and performance benchmarking on constrained devices. Use when deploying models to mobile phones, IoT devices, or embedded systems where cloud inference is impractical due to latency, cost, or connectivity constraints. license: MIT allowed-tools: Read Write Edit Bash Grep Glob WebFetch metadata: author: Philipp Thoss version: "1.0" domain: edge-computing complexity: advanced language: multi tags: edge-ai, google-ai-edge, gemma, tflite, onnx, quantization, on-device
Deploy Edge AI Model
See Extended Examples for complete configuration files, quantization scripts, and benchmark templates.
ML → edge devices. Optimized inference, HW accel, on-device mgmt.
Use When
- LLMs (Gemma 4, Phi, Llama) → mobile via Google AI Edge Gallery
- Convert → TFLite/ONNX for on-device
- Quantize → INT8/INT4, less mem + faster
- Android/iOS apps w/ local AI
- HW delegate select (GPU, NPU, DSP, Hexagon, CoreML)
- Bench latency + mem on target
- MediaPipe tasks → mobile/embedded
In
- Required: Trained model (SavedModel, PyTorch, ONNX, HF checkpoint)
- Required: Target platform (Android, iOS, Linux embedded, browser)
- Required: Device constraints (RAM, storage, compute)
- Optional: Calibration dataset → post-training quant
- Optional: AI Edge Gallery config → LLM deploy
- Optional: HW delegate prefs
Do
Step 1: Eval model → edge
Size, latency, device cap.
# assess_model.py
import os
import tensorflow as tf
def assess_model_for_edge(saved_model_path, target_ram_mb=4096):
"""Evaluate whether a model is suitable for edge deployment."""
model = tf.saved_model.load(saved_model_path)
# Check model size on disk
model_size_mb = sum(
os.path.getsize(os.path.join(dp, f))
for dp, _, filenames in os.walk(saved_model_path)
for f in filenames
) / (1024 * 1024)
print(f"Model size: {model_size_mb:.1f} MB")
print(f"Target RAM: {target_ram_mb} MB")
print(f"Size/RAM ratio: {model_size_mb / target_ram_mb:.2%}")
if model_size_mb > target_ram_mb * 0.25:
print("WARNING: Model exceeds 25% of device RAM - quantization recommended")
return False
return True
Decision matrix:
| Model Size | Device RAM | Recommended Action |
|---|---|---|
| < 50 MB | 2+ GB | Direct TFLite conversion |
| 50-500 MB | 4+ GB | INT8 quantization + TFLite |
| 500 MB-2 GB | 6+ GB | INT4 quantization + AI Edge Gallery |
| 2-4 GB | 8+ GB | Gemma 4 via AI Edge Gallery with INT4 |
| > 4 GB | 12+ GB | Weight streaming or cloud-edge hybrid |
→ Assessment done, size/RAM ratios, quant recommendation by constraints.
If err: SavedModel path valid (ls saved_model/), TF installed (python -c "import tensorflow"), disk space OK, format supported.
Step 2: LLMs via Google AI Edge Gallery
Gemma 4 + LLMs → Android.
# Clone AI Edge Gallery
git clone https://github.com/nickoala/ai-edge-gallery.git
cd ai-edge-gallery
# Build the Android app
./gradlew assembleDebug
# Install on connected device
adb install -r app/build/outputs/apk/debug/app-debug.apk
Gemma 4 config:
{
"models": [
{
"name": "Gemma 4 2B IT",
"url": "https://huggingface.co/google/gemma-4-2b-it-gpu-int4",
"format": "tflite",
"backend": "gpu",
"config": {
"max_tokens": 1024,
"temperature": 0.7,
"top_k": 40,
"top_p": 0.95
}
},
{
"name": "Gemma 4 4B IT",
"url": "https://huggingface.co/google/gemma-4-4b-it-gpu-int4",
"format": "tflite",
"backend": "gpu",
"config": {
"max_tokens": 2048,
"temperature": 0.7
}
}
]
}
Programmatic inference w/ LLM Inference API:
# gemma_edge_inference.py
from mediapipe.tasks.genai import llm_inference
# Configure the LLM
options = llm_inference.LlmInferenceOptions(
model_path="/data/local/tmp/gemma-4-2b-it-int4.tflite",
max_tokens=512,
temperature=0.7,
top_k=40,
supported_lora_ranks=[4, 8, 16] # Optional LoRA support
)
# Create inference engine
engine = llm_inference.LlmInference(options=options)
# Run inference
response = engine.generate_response("Explain edge computing in one sentence.")
print(response)
# Streaming inference
for chunk in engine.generate_response_async("List three benefits of on-device AI."):
print(chunk, end="", flush=True)
→ App builds+installs, Gemma 4 downloads, coherent responses, GPU delegate active.
If err: SDK ≥ 26 (adb shell getprop ro.build.version.sdk), device storage OK, GPU delegate supported (adb logcat | grep -i delegate), HF access, ADB connection (adb devices).
Step 3: Convert + quantize w/ TFLite
Standard → TFLite w/ post-training quant.
# convert_tflite.py
import os
import tensorflow as tf
import numpy as np
def convert_to_tflite(saved_model_path, output_path, quantization="dynamic"):
"""Convert SavedModel to TFLite with quantization."""
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
if quantization == "dynamic":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
elif quantization == "int8":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Representative dataset for calibration
def representative_dataset():
for _ in range(100):
yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]
converter.representative_dataset = representative_dataset
elif quantization == "float16":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()
with open(output_path, "wb") as f:
f.write(tflite_model)
original_size = sum(
os.path.getsize(os.path.join(dp, f))
for dp, _, filenames in os.walk(saved_model_path)
for f in filenames
) / (1024 * 1024)
quantized_size = len(tflite_model) / (1024 * 1024)
print(f"Original: {original_size:.1f} MB -> Quantized: {quantized_size:.1f} MB")
print(f"Compression ratio: {original_size / quantized_size:.1f}x")
# Usage
convert_to_tflite("saved_model/", "model_int8.tflite", quantization="int8")
ONNX Runtime quant alt:
# quantize_onnx.py
from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType
# Dynamic quantization (no calibration data needed)
quantize_dynamic(
model_input="model.onnx",
model_output="model_int8.onnx",
weight_type=QuantType.QInt8
)
# Static quantization (better accuracy, needs calibration)
# ... (see EXAMPLES.md for complete calibration workflow)
→ TFLite gen'd, size -2-4x w/ INT8, accuracy within 1-2%, ONNX quant valid.
If err: TF ≥ 2.15, rep dataset matches input shape, all ops supported (converter.allow_custom_ops = True fallback), ONNX opset compat.
Step 4: HW delegates
Select + config.
# configure_delegates.py
import tensorflow as tf
def create_interpreter_with_delegate(model_path, delegate="gpu"):
"""Create TFLite interpreter with hardware delegate."""
if delegate == "gpu":
delegate_obj = tf.lite.experimental.load_delegate(
"libtensorflowlite_gpu_delegate.so",
options={"precision": "fp16", "allow_quantized_models": "true"}
)
elif delegate == "nnapi":
# Android Neural Networks API - routes to NPU/DSP
delegate_obj = tf.lite.experimental.load_delegate(
"libtensorflowlite_nnapi_delegate.so"
)
elif delegate == "xnnpack":
# Optimized CPU inference
delegate_obj = None # XNNPACK is default in TFLite
interpreter = tf.lite.Interpreter(
model_path=model_path,
experimental_delegates=[delegate_obj] if delegate_obj else None,
num_threads=4
)
interpreter.allocate_tensors()
return interpreter
Delegate guide:
| Device | Best Delegate | Fallback | Notes |
|---|---|---|---|
| Android (Qualcomm) | NNAPI -> Hexagon DSP | GPU -> XNNPACK | Check nnapi_accelerator_name |
| Android (MediaTek) | NNAPI -> APU | GPU -> XNNPACK | Dimensity chips have dedicated APU |
| Android (Samsung) | NNAPI -> NPU | GPU -> XNNPACK | Exynos NPU via NNAPI |
| iOS | CoreML delegate | Metal GPU | Use coreml_delegate for ANE |
| Linux embedded | GPU (if available) | XNNPACK | RPi uses XNNPACK CPU |
| Browser | WebGL / WebGPU | WASM SIMD | Via TensorFlow.js |
→ Delegate loads, inference on accel, latency 2-10x vs CPU-only.
If err: Lib on device, delegate supported (adb shell cat /proc/cpuinfo), fall back XNNPACK, OpenCL for GPU, NNAPI ver.
Step 5: Bench on-device
Latency, mem, power.
# Use TFLite benchmark tool
adb push model_int8.tflite /data/local/tmp/
# CPU benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--num_threads=4 \
--num_runs=50 \
--warmup_runs=5
# GPU benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--use_gpu=true \
--num_runs=50
# NNAPI benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--use_nnapi=true \
--nnapi_accelerator_name=google-edgetpu \
--num_runs=50
Python bench:
# benchmark_edge.py
import time
import numpy as np
import psutil
def benchmark_inference(interpreter, input_data, num_runs=100):
"""Benchmark TFLite model inference."""
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Warmup
for _ in range(10):
interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
# Benchmark
latencies = []
mem_before = psutil.Process().memory_info().rss / (1024 * 1024)
for _ in range(num_runs):
start = time.perf_counter()
interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
latencies.append((time.perf_counter() - start) * 1000)
mem_after = psutil.Process().memory_info().rss / (1024 * 1024)
print(f"Latency (p50): {np.percentile(latencies, 50):.1f} ms")
print(f"Latency (p95): {np.percentile(latencies, 95):.1f} ms")
print(f"Latency (p99): {np.percentile(latencies, 99):.1f} ms")
print(f"Memory delta: {mem_after - mem_before:.1f} MB")
print(f"Throughput: {1000 / np.mean(latencies):.1f} inferences/sec")
→ Latency percentiles + mem + throughput. GPU 2-5x vs CPU. Gemma 4 2B → 10-30 tok/sec flagship.
If err: Bench binary matches arch (arm64-v8a), model pushed (adb shell ls /data/local/tmp/), storage OK, kill bg apps, thermal throttle check (adb shell cat /sys/class/thermal/thermal_zone*/temp).
Step 6: Package → prod
Mobile app w/ embedded/downloadable model.
// Android: EdgeAIManager.kt
import com.google.mediapipe.tasks.genai.llminference.LlmInference
class EdgeAIManager(private val context: Context) {
private var llmInference: LlmInference? = null
fun initialize(modelPath: String) {
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(512)
.setTemperature(0.7f)
.setTopK(40)
.setResultListener { result, done ->
// Handle streaming tokens
onTokenReceived(result, done)
}
.build()
llmInference = LlmInference.createFromOptions(context, options)
}
fun generateResponse(prompt: String): String {
return llmInference?.generateResponse(prompt)
?: throw IllegalStateException("Model not initialized")
}
fun release() {
llmInference?.close()
llmInference = null
}
}
Download + cache:
// ModelDownloader.kt
class ModelDownloader(private val context: Context) {
private val modelDir = File(context.filesDir, "models")
suspend fun ensureModel(modelName: String, url: String): File {
val modelFile = File(modelDir, modelName)
if (modelFile.exists()) return modelFile
modelDir.mkdirs()
// Download with progress tracking
// ... (see EXAMPLES.md for complete implementation)
return modelFile
}
}
→ App builds w/ MediaPipe, model loads first launch, latency OK, cached after download, fallback on unsupported.
If err: minSdk ≥ 26, MediaPipe dep ver, model SHA256, storage, ProGuard preserves MediaPipe classes, test multi-device.
Check
- Model → TFLite/ONNX w/o op errs
- Quant accuracy < 2% degrade
- HW delegate loads + accels
- Latency meets target (< 100ms vision, < 50ms/tok LLM)
- Mem within budget
- AI Edge Gallery runs Gemma 4
- On-device LLM coherent
- App handles download/cache/update
- Graceful degrade on unsupported
- Battery acceptable
Traps
- Unsupported TFLite ops: Custom ops fail →
converter.allow_custom_ops = Trueor replace, check compat list - Quant accuracy loss: INT4 degrades sensitive → mixed precision, calibrate w/ rep data
- Delegate init fail: GPU crashes old devices → CPU fallback, check compat
- Mem pressure: Model + app > RAM → memory-mapped, unload, batch=1
- Thermal throttle: Sustained inference → overheat → duty cycle, reduce freq, monitor zones
- Download size: Large over cellular → Wi-Fi-only, resumable, progressive
- Version fragmentation: Works some not others → device matrix test, NNAPI ver checks, compat DB
→
deploy-ml-model-serving— cloud serving (complement to edge)monitor-model-drift— quality over timeregister-ml-model— register before edge deploycreate-dockerfile— containerize conversion pipelinecreate-multistage-dockerfile— multi-stage builds