quantized-llama2-7b-mlc

star 50

Deploy quantized Llama2-7B with MLC LLM on Jetson Orin NX for fast edge inference. Uses jetson-containers Docker workflow with 4-bit quantization (q4f16_ft). Requires Jetson Orin with ≥16GB RAM, JetPack 5.x, and HuggingFace access token.

Seeed-Projects By Seeed-Projects schedule Updated 3/11/2026

name: quantized-llama2-7b-mlc description: Deploy quantized Llama2-7B with MLC LLM on Jetson Orin NX for fast edge inference. Uses jetson-containers Docker workflow with 4-bit quantization (q4f16_ft). Requires Jetson Orin with ≥16GB RAM, JetPack 5.x, and HuggingFace access token.

Quantized Llama2-7B with MLC LLM on Jetson


Execution model

Run one phase at a time. After each phase:

  • Relay all command output to the user.
  • If output contains [STOP] → stop immediately, consult the failure decision tree below.
  • If output ends with [OK] → tell the user "Phase N complete" and proceed to the next phase.

Prerequisites

Requirement Minimum
Hardware reComputer J4012 (Jetson Orin NX 16GB) or equivalent
RAM ≥ 16 GB
JetPack 5.x (R35.x)
Storage SSD recommended — model weights + Docker images are large
Internet Required for Docker pull and model download
HuggingFace Access token with Llama2 model access granted

Phase 1 — Preflight

cat /etc/nv_tegra_release
free -h
df -h /

Expected: R35.x (JP5), ≥16 GB RAM, ≥50 GB disk free. [OK] when all pass. [STOP] if insufficient RAM or disk.


Phase 2 — Install dependencies and clone jetson-containers

sudo apt-get update
sudo apt-get install -y git python3-pip
git clone --depth=1 https://github.com/dusty-nv/jetson-containers
cd jetson-containers
pip3 install -r requirements.txt

Clone the MLC-LLM helper scripts:

cd ./data
git clone https://github.com/LJ-Hao/MLC-LLM-on-Jetson-Nano.git
cd ..

[OK] when both repos are cloned and requirements installed. [STOP] if git clone fails.


Phase 3 — Pull MLC Docker image and download Llama2 model

Replace <YOUR-ACCESS-TOKEN> with your HuggingFace token:

./run.sh --env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN> $(./autotag mlc) \
  /bin/bash -c 'ln -s $(huggingface-downloader meta-llama/Llama-2-7b-chat-hf) /data/models/mlc/dist/models/Llama-2-7b-chat-hf'

Verify the Docker image was created:

sudo docker images | grep mlc

[OK] when MLC image is listed and model download completed. [STOP] if image not found or download failed.


Phase 4 — Quantize the model with MLC

./run.sh $(./autotag mlc) \
  python3 -m mlc_llm.build \
  --model Llama-2-7b-chat-hf \
  --quantization q4f16_ft \
  --artifact-path /data/models/mlc/dist \
  --max-seq-len 4096 \
  --target cuda \
  --use-cuda-graph \
  --use-flash-attn-mqa

[OK] when quantization completes without errors. [STOP] if OOM or build errors.


Phase 5 — Run inference

Enter the Docker container (use the image name from Phase 3):

./run.sh <YOUR_MLC_IMAGE_NAME>
# e.g.: ./run.sh dustynv/mlc:51fb0f4-builder-r35.4.1

Inside the container, run the quantized model:

cd /data/MLC-LLM-on-Jetson
python3 Llama-2-7b-chat-hf-q4f16_ft.py

For comparison, you can also try the non-quantized version (will likely OOM on 16GB):

python3 Llama-2-7b-chat-hf.py

[OK] when the quantized model generates text responses successfully.


Failure decision tree

Symptom Action
git clone fails Check internet connectivity. Verify git is installed.
HuggingFace download fails Verify token is valid and has Llama2 access. Visit https://huggingface.co/meta-llama/Llama-2-7b-chat-hf to request access.
Docker image not found after ./run.sh Run sudo docker images to check. Re-run the Phase 3 command.
OOM during quantization Close other processes. Ensure ≥16 GB RAM. Try reducing --max-seq-len.
Non-quantized model fails to run Expected on 16 GB — the full model requires more memory. Use the quantized version.
./autotag mlc returns wrong tag Verify JetPack version matches. The autotag script selects based on L4T version.
Slow inference Ensure --use-cuda-graph and --use-flash-attn-mqa flags were used during quantization.

Reference files

  • references/source.body.md — full original Seeed tutorial with Docker screenshots, comparison between quantized and non-quantized inference, and video demonstration (reference only)
Install via CLI
npx skills add https://github.com/Seeed-Projects/Seeed-Jetson-DevelopTool --skill quantized-llama2-7b-mlc
Repository Details
star Stars 50
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator
Seeed-Projects
Seeed-Projects Explore all skills →