chat-test

star 3

Run LLM inference tests with Qwen or other models. Use when testing model loading, inference, CUDA Graph, or generation quality.

m96-chan

By m96-chan schedule Updated 12/28/2025

play_arrow Run Skill in Manus View GitHub

name: chat-test description: Run LLM inference tests with Qwen or other models. Use when testing model loading, inference, CUDA Graph, or generation quality.

LLM Chat Test

Test LLM inference with PyGPUkit.

Usage

# Basic chat CLI
python examples/chat_cli.py --model /path/to/model

# Chat with thinking mode
python examples/chat_cli_thinking.py --model /path/to/model

# MoE model (Qwen3-8B etc.)
python examples/chat_cli_moe.py --model /path/to/model

Test Models

Local test models:

Qwen3-8B: /c/Users/y_har/.cache/huggingface/hub/models--Aratako--Qwen3-8B-ERP-v0.1/
TinyLlama-1.1B: /c/Users/y_har/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/

Instructions

Ensure project is built
Run the appropriate chat CLI
Test generation quality and performance
Report:
- Model loading success
- First token latency
- Tokens per second
- Any errors or issues

CUDA Graph Testing

# Enable CUDA Graph for decode
python examples/chat_cli_moe.py --model /path/to/model --use-cuda-graph

Notes

Use HuggingFace tokenizers (not built-in)
Large models require significant VRAM
CUDA Graph provides ~1.2x speedup for decode

Install via CLI

npx skills add https://github.com/m96-chan/PyGPUkit --skill chat-test

Repository Details

star Stars 3

call_split Forks 0

navigation Branch main

article Path SKILL.md

More from Creator

m96-chan

m96-chan Explore all skills →