Explore AI Agent Skills & Claude Prompts

LightLLM Qwen3-VL-8B-Instruct: api_server tp 2 on port 8089, then lmms-eval CLI (python -m lmms_eval, model openai_compatible, tasks mmmu_val, batch_size 900) with OPENAI_API_BASE pointing at LightLLM OpenAI-compatible /v1. Restore https_proxy for Hub while no_proxy includes 127.0.0.1. Requires lmms-eval install, OPENAI_API_KEY placeholder, LOG_DIR and MODEL_DIR, nvidia-smi GPU choice, pipefail with tee, summary.txt. No wrapper script; use command line only.

test-model-qwen3-vl-8b-vit-sep-mode

LightLLM Qwen3-VL-8B-Instruct visual separation (ViT sep / proxy): three processes in order—config_server on 8090; internal Redis on 6000; visual_only with visual_rpyc 8091 and afs_image_embed_dir; normal api_server tp 2 port 8089 with visual_use_proxy_mode. After HTTP /v1/models on normal, lmms_eval mmmu_val (openai_compatible, batch 900, OPENAI_API_BASE http://HOST:8089/v1); restore https_proxy for Hub while no_proxy includes 127.0.0.1. lmms_eval_out, console log, mmmu_acc in summary. pipefail for tee exit code.

test-model-deepseekr1-mtp-tp

DeepSeek-R1 MTP-TP test: LightLLM api_server with MTP (EAGLE) draft, tensor parallel only (--tp 8, no --dp, no EP MoE), plus GSM8K lm_eval on localhost. Distinct from the MTP-EP-TPDP skill which uses --tp 8 --dp 8 and EP MoE. Requires a dedicated log directory, summary.txt, tokenizer aligned with MODEL_DIR. Use for TP-only MTP gsm8k accuracy runs.

test-model-deepseekr1-mtp-ep

Runs LightLLM DeepSeek-R1 EP MoE + MTP (EAGLE) server variants and GSM8K lm_eval against localhost. Requires each full run to use a dedicated log directory: persist every api_server process log under that tree (per-variant subdirectories recommended), write the consolidated summary to summary.txt in that same log directory, and keep artifacts separated from other test runs. Use when running DeepSeek-R1 MTP EP accuracy workflows or when the user asks to run these four server configurations one-by-one with logged results.

test-model-deepseekr1-base-tp

Runs LightLLM DeepSeek-R1 baseline TP gsm8k: single api_server with --tp 8 and --batch_max_tokens only, no MTP draft, no --dp, no EP MoE (distinct from deepseekr1-mtp-tp which adds MTP). GSM8K lm_eval on localhost port 8089. Requires a dedicated log directory, api_server and eval logs under that tree, summary.txt as consolidated report, tokenizer aligned with MODEL_DIR. Use for baseline R1 tensor-parallel accuracy runs without MTP/EP.

test-model-qwen3-8b-pd-nixl

LightLLM Qwen3-8b PD disaggregation gsm8k: pd_master on 8089, prefill on 8001, decode on 8002, tp 2 each. Assign four GPUs via nvidia-smi then export PREFILL_CUDA_DEVICES / DECODE_CUDA_DEVICES (no fixed card IDs; no complex shell automation). UCX_NET_DEVICES and TLS for RDMA per cluster. lm_eval hits pd_master URL. HOST vs PD_MASTER_IP when co-located. Before lm_eval, must POST one completion via curl to pd_master for warmup verification. Requires LOG_DIR, MODEL_DIR, proxy cleared, no_proxy, summary.txt. Same-GPU model_infer + pd_*_trans need NVIDIA MPS for best KV copy perf; record MPS on/off in summary. Run check_nvidia_peermem.sh in this skill dir; record in summary.txt. Use for PD separation tests with either the default NIXL transport or NCCL transport.

schedule Updated 15 days ago

test-model-deepseekv32-ep

Runs LightLLM DeepSeek-V3.2 EP MoE gsm8k: api_server with --tp 8 --dp 8 --enable_ep_moe, tool_call_parser deepseekv32, reasoning_parser deepseek-v3, graph_max_batch_size 32, mem_fraction 0.8, LOADWORKER 14, port 8000 aligned with lm_eval base_url. Requires a dedicated log directory, api_server and eval logs, summary.txt consolidated report. lm_eval uses tokenizer_backend=null (server-side tokenization) because local transformers does not recognize model_type deepseek_v32. Distinct from R1 MTP/Base flows. Use for V3.2 EP MoE gsm8k accuracy on LightLLM.

schedule Updated 20 days ago

test-model-qwen2-5-14b-fp8kv-gsm8k

LightLLM Qwen2.5-14B-Instruct GSM8K with FP8 KV cache quantization: either fp8kv_sph (per-head calibration JSON) or fp8kv_spt (per-tensor calibration JSON). Single api_server tp 2 fixed HTTP port 8089 (not configurable), lm_eval local-completions. Assign GPUs via nvidia-smi then export CUDA_VISIBLE_DEVICES. Before starting api_server, cwd must be LightLLM repo root; pass --kv_quant_calibration_config_path as the repo-relative path from the table row that matches --llm_kv_type (fp8kv_sph with per-head JSON only; fp8kv_spt with per-tensor JSON only; no absolute path, no REPO_ROOT/CALIB_JSON shell concatenation). If default MODEL_DIR path is missing or load fails with path errors, ask the user for the correct MODEL_DIR. LOG_DIR, summary.txt, port listen checks (not health), no_proxy, background server with log redirect. Two variants documented in one skill.

test-model-qwen3-5-0-8b-gsm8k-scenarios

LightLLM Qwen3.5-0.8B GSM8K multi-scenario regression: five isolated runs (baseline api_server, prefill cudagraph, linear-attention cache flags, CPU cache plus linear-att, disk cache with LIGHTLLM_DISK_CACHE_PROMPT_LIMIT_LENGTH). Each scenario uses api_server tp 2 port 8089, then lm_eval local-completions gsm8k batch 500. Scenarios 4 and 5 run lm_eval twice for cache warm hit. Per-scenario LOG_DIR, server.log, eval logs, summary.txt. GPUs from nvidia-smi; port listen readiness not health; clear proxies and set no_proxy. Default MODEL_DIR HuggingFace hub snapshot path; default DISK_CACHE_DIR /mtc/test/tmp/ for scenario 5; ask user for paths if missing or not writable.

test-model-qwen3-5-0-8b-pd-nixl

LightLLM Qwen3.5-0.8B PD disaggregation over NIXL gsm8k: pd_master on 8089, prefill on 8001, decode on 8002. Supports TP1 and TP2 runs by setting TP / PREFILL_CUDA_DEVICES / DECODE_CUDA_DEVICES. Qwen3.5 has linear-attention state transfer; use --pd_kv_page_size 2048 and --pd_kv_page_num 16. lm_eval hits pd_master URL. Requires UCX/RDMA env, nvidia_peermem check, curl warmup before lm_eval, registration wait in pd_master.log, and summary.txt. Includes optional repeated-prompt decode cache probe for linear-att page-boundary behavior.

schedule Updated 15 days ago

test-model-qwen3-8b-gsm8k-scenarios

LightLLM Qwen3-8B GSM8K multi-scenario regression: seven isolated api_server configs (baseline, vllm-fp8w8a8 quant, tpsp mix, tpsp with dp2 and dp prefill balance, cpu cache, int8kv on top of cpu cache, disk cache with LIGHTLLM_DISK_CACHE_PROMPT_LIMIT_LENGTH). Each scenario then lm_eval gsm8k batch 500. Scenarios 5–7 run lm_eval twice for cache hit. Per-scenario LOG_DIR, server.log, eval logs, summary.txt. Default MODEL_DIR /mtc/models/qwen3-8b; DISK_CACHE_DIR /mtc/test/tmp/ for scenario 7; ask user if paths invalid. Fixed HTTP port 8089 (not configurable). nvidia-smi GPUs, port listen not health, clear proxies and no_proxy.

test-model-common

Common override guidance for all skills/test_model sub-skills. Applies to LightLLM model accuracy/speed tests that use lm_eval or lmms_eval, especially local-completions GSM8K runs.