bitnet - SKILL.md Agent Skill

name: bitnet description: "Microsoft BitNet — 1-bit LLM setup, inference, and benchmarking on CPU. Automates the full workflow: clone bitnet.cpp, create conda env, download GGUF models from HuggingFace, build optimized ternary kernels, and run inference. Supports official Microsoft models (2B) and community models (0.7B-10B). Use when: (1) setting up BitNet/bitnet.cpp for local CPU inference, (2) downloading and running 1-bit/ternary LLMs, (3) benchmarking BitNet vs full-precision models, (4) building edge/agentic inference pipelines without GPU, (5) converting HuggingFace models to GGUF for bitnet.cpp. Triggers on: 'bitnet', '1-bit llm', '1.58-bit', 'ternary model', 'ternary weights', 'edge inference', 'cpu inference', 'bitnet.cpp', 'bitlinear', 'no gpu inference'."

BitNet — 1-Bit LLM Operations

Set up and run Microsoft's BitNet (1.58-bit ternary LLMs) for efficient CPU inference. Models use weights of {-1, 0, +1} — no GPU required.

Quick Start

# Full setup in 5 commands
./scripts/install-bitnet.sh
./scripts/download-model.sh microsoft/BitNet-b1.58-2B-4T-gguf
./scripts/build-bitnet.sh
./scripts/run-inference.sh -p "You are a helpful assistant" -cnv

Or manually:

git clone --recursive https://github.com/microsoft/BitNet.git ~/BitNet
cd ~/BitNet
conda create -n bitnet-cpp python=3.9 -y && conda activate bitnet-cpp
pip install -r requirements.txt
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "Hello" -n 128

Prerequisites

Tool	Install	Required
Python 3.9+	`brew install python@3.9` or conda	Yes
CMake 3.22+	`brew install cmake`	Yes
Clang 18+	`brew install llvm`	Yes
conda	`brew install --cask miniconda`	Yes
huggingface-cli	`pip install huggingface-hub`	Yes

Operations

Install BitNet

./scripts/install-bitnet.sh [--dir ~/BitNet]

Clones the BitNet repo, creates bitnet-cpp conda environment, installs Python dependencies.

Download a Model

./scripts/download-model.sh <model-id> [--dir ~/BitNet]

Downloads a GGUF model from HuggingFace into the BitNet models directory.

Recommended models (see references/models.md for full catalog):

Model	Params	Memory	Quality	Best For
`microsoft/BitNet-b1.58-2B-4T-gguf`	2B	0.4 GB	Good	Default, edge agents
`1bitLLM/bitnet_b1_58-3B`	3.3B	0.7 GB	Better	General use
`HF1BitLLM/Llama3-8B-1.58-100B-tokens`	8B	1.6 GB	Best	Quality-focused

Build for Your CPU

./scripts/build-bitnet.sh [--dir ~/BitNet] [--model-dir models/BitNet-b1.58-2B-4T]

Compiles bitnet.cpp with optimized LUT kernels for the local CPU architecture (ARM NEON/DOTPROD or x86 AVX2).

Run Inference

# Chat mode
./scripts/run-inference.sh -p "You are a helpful assistant" -cnv

# Single prompt
./scripts/run-inference.sh -p "Explain ternary quantization" -n 256

# Custom model
./scripts/run-inference.sh --model models/custom/ggml-model-i2_s.gguf -p "Hello"

Key flags: -cnv (chat mode), -n N (max tokens), -t N (threads), -temp F (temperature), -c N (context size, max 4096).

Benchmark

./scripts/benchmark.sh [--dir ~/BitNet]

Runs throughput and latency benchmarks, reports tokens/sec and energy per token.

Convert HuggingFace Models to GGUF

For models in safetensors/BF16 format:

cd ~/BitNet
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir models/bf16
python utils/convert-helper-bitnet.py models/bf16

HuggingFace Transformers (No Speed Benefit)

For prototyping only — no ternary kernel optimization:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/bitnet-b1.58-2B-4T", torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/bitnet-b1.58-2B-4T")

messages = [{"role": "user", "content": "What are ternary weights?"}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
output = model.generate(inputs, max_new_tokens=200)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

Performance Reference

Metric	BitNet 2B	LLaMA 3.2 1B	Qwen2.5 1.5B
Memory	0.4 GB	2.0 GB	2.6 GB
Decode latency	29 ms	48 ms	65 ms
Energy/token	0.028 J	0.258 J	0.347 J
ARC-Challenge	49.91	38.40	46.33
GSM8K	58.38	28.05	55.50

CPU speedups via bitnet.cpp: ARM 1.37-5.07x, x86 2.37-6.17x.

Agentic Use Cases

Dual-model architecture: Use BitNet as the fast local brain (29ms/step) for tool selection, routing, and guard rails. Escalate complex reasoning to a cloud LLM.

Agent swarm: 10 BitNet 2B agents = ~4 GB total RAM. No API costs, no rate limits, works offline.

Edge deployment: Raspberry Pi, air-gapped environments, continuous monitoring agents.

Limitations

4,096 token context limit
Research-stage only (Microsoft's disclaimer)
Fine-tuning requires native ternary training (can't quantize existing models)
GPU support limited to NVIDIA A100
One official model (2B); community models vary in quality