deploy-edge-ai-model

star 21

Deploy machine learning models to edge devices using Google AI Edge Gallery, TensorFlow Lite, ONNX Runtime, and MediaPipe. Covers model quantization (INT8/INT4), on-device inference with Gemma 4 models, Android/iOS deployment via AI Edge Gallery, hardware delegate selection (GPU/NPU/DSP), and performance benchmarking on constrained devices. Use when deploying models to mobile phones, IoT devices, or embedded systems where cloud inference is impractical due to latency, cost, or connectivity constraints.

pjt222 By pjt222 schedule Updated 6/5/2026

name: deploy-edge-ai-model locale: caveman-ultra source_locale: en source_commit: 82c77053 translator: "Julius Brussee homage — caveman" translation_date: "2026-04-19" description: > Deploy machine learning models to edge devices using Google AI Edge Gallery, TensorFlow Lite, ONNX Runtime, and MediaPipe. Covers model quantization (INT8/INT4), on-device inference with Gemma 4 models, Android/iOS deployment via AI Edge Gallery, hardware delegate selection (GPU/NPU/DSP), and performance benchmarking on constrained devices. Use when deploying models to mobile phones, IoT devices, or embedded systems where cloud inference is impractical due to latency, cost, or connectivity constraints. license: MIT allowed-tools: Read Write Edit Bash Grep Glob WebFetch metadata: author: Philipp Thoss version: "1.0" domain: edge-computing complexity: advanced language: multi tags: edge-ai, google-ai-edge, gemma, tflite, onnx, quantization, on-device

Deploy Edge AI Model

See Extended Examples for complete configuration files, quantization scripts, and benchmark templates.

ML → edge devices. Optimized inference, HW accel, on-device mgmt.

Use When

  • LLMs (Gemma 4, Phi, Llama) → mobile via Google AI Edge Gallery
  • Convert → TFLite/ONNX for on-device
  • Quantize → INT8/INT4, less mem + faster
  • Android/iOS apps w/ local AI
  • HW delegate select (GPU, NPU, DSP, Hexagon, CoreML)
  • Bench latency + mem on target
  • MediaPipe tasks → mobile/embedded

In

  • Required: Trained model (SavedModel, PyTorch, ONNX, HF checkpoint)
  • Required: Target platform (Android, iOS, Linux embedded, browser)
  • Required: Device constraints (RAM, storage, compute)
  • Optional: Calibration dataset → post-training quant
  • Optional: AI Edge Gallery config → LLM deploy
  • Optional: HW delegate prefs

Do

Step 1: Eval model → edge

Size, latency, device cap.

# assess_model.py
import os
import tensorflow as tf

def assess_model_for_edge(saved_model_path, target_ram_mb=4096):
    """Evaluate whether a model is suitable for edge deployment."""
    model = tf.saved_model.load(saved_model_path)

    # Check model size on disk
    model_size_mb = sum(
        os.path.getsize(os.path.join(dp, f))
        for dp, _, filenames in os.walk(saved_model_path)
        for f in filenames
    ) / (1024 * 1024)

    print(f"Model size: {model_size_mb:.1f} MB")
    print(f"Target RAM: {target_ram_mb} MB")
    print(f"Size/RAM ratio: {model_size_mb / target_ram_mb:.2%}")

    if model_size_mb > target_ram_mb * 0.25:
        print("WARNING: Model exceeds 25% of device RAM - quantization recommended")
        return False
    return True

Decision matrix:

Model Size Device RAM Recommended Action
< 50 MB 2+ GB Direct TFLite conversion
50-500 MB 4+ GB INT8 quantization + TFLite
500 MB-2 GB 6+ GB INT4 quantization + AI Edge Gallery
2-4 GB 8+ GB Gemma 4 via AI Edge Gallery with INT4
> 4 GB 12+ GB Weight streaming or cloud-edge hybrid

→ Assessment done, size/RAM ratios, quant recommendation by constraints.

If err: SavedModel path valid (ls saved_model/), TF installed (python -c "import tensorflow"), disk space OK, format supported.

Step 2: LLMs via Google AI Edge Gallery

Gemma 4 + LLMs → Android.

# Clone AI Edge Gallery
git clone https://github.com/nickoala/ai-edge-gallery.git
cd ai-edge-gallery

# Build the Android app
./gradlew assembleDebug

# Install on connected device
adb install -r app/build/outputs/apk/debug/app-debug.apk

Gemma 4 config:

{
  "models": [
    {
      "name": "Gemma 4 2B IT",
      "url": "https://huggingface.co/google/gemma-4-2b-it-gpu-int4",
      "format": "tflite",
      "backend": "gpu",
      "config": {
        "max_tokens": 1024,
        "temperature": 0.7,
        "top_k": 40,
        "top_p": 0.95
      }
    },
    {
      "name": "Gemma 4 4B IT",
      "url": "https://huggingface.co/google/gemma-4-4b-it-gpu-int4",
      "format": "tflite",
      "backend": "gpu",
      "config": {
        "max_tokens": 2048,
        "temperature": 0.7
      }
    }
  ]
}

Programmatic inference w/ LLM Inference API:

# gemma_edge_inference.py
from mediapipe.tasks.genai import llm_inference

# Configure the LLM
options = llm_inference.LlmInferenceOptions(
    model_path="/data/local/tmp/gemma-4-2b-it-int4.tflite",
    max_tokens=512,
    temperature=0.7,
    top_k=40,
    supported_lora_ranks=[4, 8, 16]  # Optional LoRA support
)

# Create inference engine
engine = llm_inference.LlmInference(options=options)

# Run inference
response = engine.generate_response("Explain edge computing in one sentence.")
print(response)

# Streaming inference
for chunk in engine.generate_response_async("List three benefits of on-device AI."):
    print(chunk, end="", flush=True)

→ App builds+installs, Gemma 4 downloads, coherent responses, GPU delegate active.

If err: SDK ≥ 26 (adb shell getprop ro.build.version.sdk), device storage OK, GPU delegate supported (adb logcat | grep -i delegate), HF access, ADB connection (adb devices).

Step 3: Convert + quantize w/ TFLite

Standard → TFLite w/ post-training quant.

# convert_tflite.py
import os
import tensorflow as tf
import numpy as np

def convert_to_tflite(saved_model_path, output_path, quantization="dynamic"):
    """Convert SavedModel to TFLite with quantization."""
    converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)

    if quantization == "dynamic":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]

    elif quantization == "int8":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]
        converter.target_spec.supported_ops = [
            tf.lite.OpsSet.TFLITE_BUILTINS_INT8
        ]
        converter.inference_input_type = tf.int8
        converter.inference_output_type = tf.int8

        # Representative dataset for calibration
        def representative_dataset():
            for _ in range(100):
                yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]
        converter.representative_dataset = representative_dataset

    elif quantization == "float16":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]
        converter.target_spec.supported_types = [tf.float16]

    tflite_model = converter.convert()

    with open(output_path, "wb") as f:
        f.write(tflite_model)

    original_size = sum(
        os.path.getsize(os.path.join(dp, f))
        for dp, _, filenames in os.walk(saved_model_path)
        for f in filenames
    ) / (1024 * 1024)
    quantized_size = len(tflite_model) / (1024 * 1024)
    print(f"Original: {original_size:.1f} MB -> Quantized: {quantized_size:.1f} MB")
    print(f"Compression ratio: {original_size / quantized_size:.1f}x")

# Usage
convert_to_tflite("saved_model/", "model_int8.tflite", quantization="int8")

ONNX Runtime quant alt:

# quantize_onnx.py
from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType

# Dynamic quantization (no calibration data needed)
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    weight_type=QuantType.QInt8
)

# Static quantization (better accuracy, needs calibration)
# ... (see EXAMPLES.md for complete calibration workflow)

→ TFLite gen'd, size -2-4x w/ INT8, accuracy within 1-2%, ONNX quant valid.

If err: TF ≥ 2.15, rep dataset matches input shape, all ops supported (converter.allow_custom_ops = True fallback), ONNX opset compat.

Step 4: HW delegates

Select + config.

# configure_delegates.py
import tensorflow as tf

def create_interpreter_with_delegate(model_path, delegate="gpu"):
    """Create TFLite interpreter with hardware delegate."""

    if delegate == "gpu":
        delegate_obj = tf.lite.experimental.load_delegate(
            "libtensorflowlite_gpu_delegate.so",
            options={"precision": "fp16", "allow_quantized_models": "true"}
        )
    elif delegate == "nnapi":
        # Android Neural Networks API - routes to NPU/DSP
        delegate_obj = tf.lite.experimental.load_delegate(
            "libtensorflowlite_nnapi_delegate.so"
        )
    elif delegate == "xnnpack":
        # Optimized CPU inference
        delegate_obj = None  # XNNPACK is default in TFLite

    interpreter = tf.lite.Interpreter(
        model_path=model_path,
        experimental_delegates=[delegate_obj] if delegate_obj else None,
        num_threads=4
    )
    interpreter.allocate_tensors()
    return interpreter

Delegate guide:

Device Best Delegate Fallback Notes
Android (Qualcomm) NNAPI -> Hexagon DSP GPU -> XNNPACK Check nnapi_accelerator_name
Android (MediaTek) NNAPI -> APU GPU -> XNNPACK Dimensity chips have dedicated APU
Android (Samsung) NNAPI -> NPU GPU -> XNNPACK Exynos NPU via NNAPI
iOS CoreML delegate Metal GPU Use coreml_delegate for ANE
Linux embedded GPU (if available) XNNPACK RPi uses XNNPACK CPU
Browser WebGL / WebGPU WASM SIMD Via TensorFlow.js

→ Delegate loads, inference on accel, latency 2-10x vs CPU-only.

If err: Lib on device, delegate supported (adb shell cat /proc/cpuinfo), fall back XNNPACK, OpenCL for GPU, NNAPI ver.

Step 5: Bench on-device

Latency, mem, power.

# Use TFLite benchmark tool
adb push model_int8.tflite /data/local/tmp/

# CPU benchmark
adb shell /data/local/tmp/benchmark_model \
  --graph=/data/local/tmp/model_int8.tflite \
  --num_threads=4 \
  --num_runs=50 \
  --warmup_runs=5

# GPU benchmark
adb shell /data/local/tmp/benchmark_model \
  --graph=/data/local/tmp/model_int8.tflite \
  --use_gpu=true \
  --num_runs=50

# NNAPI benchmark
adb shell /data/local/tmp/benchmark_model \
  --graph=/data/local/tmp/model_int8.tflite \
  --use_nnapi=true \
  --nnapi_accelerator_name=google-edgetpu \
  --num_runs=50

Python bench:

# benchmark_edge.py
import time
import numpy as np
import psutil

def benchmark_inference(interpreter, input_data, num_runs=100):
    """Benchmark TFLite model inference."""
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Warmup
    for _ in range(10):
        interpreter.set_tensor(input_details[0]["index"], input_data)
        interpreter.invoke()

    # Benchmark
    latencies = []
    mem_before = psutil.Process().memory_info().rss / (1024 * 1024)
    for _ in range(num_runs):
        start = time.perf_counter()
        interpreter.set_tensor(input_details[0]["index"], input_data)
        interpreter.invoke()
        latencies.append((time.perf_counter() - start) * 1000)
    mem_after = psutil.Process().memory_info().rss / (1024 * 1024)

    print(f"Latency (p50): {np.percentile(latencies, 50):.1f} ms")
    print(f"Latency (p95): {np.percentile(latencies, 95):.1f} ms")
    print(f"Latency (p99): {np.percentile(latencies, 99):.1f} ms")
    print(f"Memory delta: {mem_after - mem_before:.1f} MB")
    print(f"Throughput: {1000 / np.mean(latencies):.1f} inferences/sec")

→ Latency percentiles + mem + throughput. GPU 2-5x vs CPU. Gemma 4 2B → 10-30 tok/sec flagship.

If err: Bench binary matches arch (arm64-v8a), model pushed (adb shell ls /data/local/tmp/), storage OK, kill bg apps, thermal throttle check (adb shell cat /sys/class/thermal/thermal_zone*/temp).

Step 6: Package → prod

Mobile app w/ embedded/downloadable model.

// Android: EdgeAIManager.kt
import com.google.mediapipe.tasks.genai.llminference.LlmInference

class EdgeAIManager(private val context: Context) {
    private var llmInference: LlmInference? = null

    fun initialize(modelPath: String) {
        val options = LlmInference.LlmInferenceOptions.builder()
            .setModelPath(modelPath)
            .setMaxTokens(512)
            .setTemperature(0.7f)
            .setTopK(40)
            .setResultListener { result, done ->
                // Handle streaming tokens
                onTokenReceived(result, done)
            }
            .build()

        llmInference = LlmInference.createFromOptions(context, options)
    }

    fun generateResponse(prompt: String): String {
        return llmInference?.generateResponse(prompt)
            ?: throw IllegalStateException("Model not initialized")
    }

    fun release() {
        llmInference?.close()
        llmInference = null
    }
}

Download + cache:

// ModelDownloader.kt
class ModelDownloader(private val context: Context) {
    private val modelDir = File(context.filesDir, "models")

    suspend fun ensureModel(modelName: String, url: String): File {
        val modelFile = File(modelDir, modelName)
        if (modelFile.exists()) return modelFile

        modelDir.mkdirs()
        // Download with progress tracking
        // ... (see EXAMPLES.md for complete implementation)
        return modelFile
    }
}

→ App builds w/ MediaPipe, model loads first launch, latency OK, cached after download, fallback on unsupported.

If err: minSdk ≥ 26, MediaPipe dep ver, model SHA256, storage, ProGuard preserves MediaPipe classes, test multi-device.

Check

  • Model → TFLite/ONNX w/o op errs
  • Quant accuracy < 2% degrade
  • HW delegate loads + accels
  • Latency meets target (< 100ms vision, < 50ms/tok LLM)
  • Mem within budget
  • AI Edge Gallery runs Gemma 4
  • On-device LLM coherent
  • App handles download/cache/update
  • Graceful degrade on unsupported
  • Battery acceptable

Traps

  • Unsupported TFLite ops: Custom ops fail → converter.allow_custom_ops = True or replace, check compat list
  • Quant accuracy loss: INT4 degrades sensitive → mixed precision, calibrate w/ rep data
  • Delegate init fail: GPU crashes old devices → CPU fallback, check compat
  • Mem pressure: Model + app > RAM → memory-mapped, unload, batch=1
  • Thermal throttle: Sustained inference → overheat → duty cycle, reduce freq, monitor zones
  • Download size: Large over cellular → Wi-Fi-only, resumable, progressive
  • Version fragmentation: Works some not others → device matrix test, NNAPI ver checks, compat DB

  • deploy-ml-model-serving — cloud serving (complement to edge)
  • monitor-model-drift — quality over time
  • register-ml-model — register before edge deploy
  • create-dockerfile — containerize conversion pipeline
  • create-multistage-dockerfile — multi-stage builds
Install via CLI
npx skills add https://github.com/pjt222/agent-almanac --skill deploy-edge-ai-model
Repository Details
star Stars 21
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator