moonshot-kimi-api - SKILL.md Agent Skill

name: moonshot-kimi-api description: "Guide for building applications with the Moonshot Kimi API, focusing on the Kimi K2.5 model. Use this skill whenever the user wants to integrate or call the Kimi API, use Moonshot's large language models (kimi-k2.5, kimi-k2, kimi-k2-thinking), implement thinking/instant modes, tool calling / function calling with Kimi, streaming output, JSON mode, partial mode (prefix continuation), web search, multimodal input (images/video), or multi-turn conversations via Kimi API. Also trigger when code imports openai and sets base_url to api.moonshot.ai or api.moonshot.cn, or when the user mentions 'moonshot', 'kimi', 'kimi-k2', or 'kimi-k2.5' in an API context."

Moonshot Kimi API Development Guide

Build applications powered by Moonshot's Kimi large language models. The API is OpenAI-compatible — use the standard openai Python/Node SDK with a different base_url.

Quick Reference

Item	Value
Base URL (Global)	`https://api.moonshot.ai/v1`
Base URL (China)	`https://api.moonshot.cn/v1`
Auth Header	`Authorization: Bearer $MOONSHOT_API_KEY`
Primary Model	`kimi-k2.5` (multimodal, 256K context, 1T MoE / 32B active)
Other Models	`kimi-k2`, `kimi-k2-0905`, `kimi-k2-thinking`
Chat Endpoint	`POST /v1/chat/completions`
Files Endpoint	`POST /v1/files`
Pricing	~$0.60/M input, ~$2.50/M output tokens

Reference Files

Read these for detailed examples when implementing specific features:

File	Contents
`references/tool-calling.md`	Tool/function calling definitions, execution loop, streaming tool calls, multi-step agentic patterns
`references/streaming.md`	Basic streaming, thinking-mode streaming, tool-call streaming, auto-reconnect
`references/advanced-features.md`	JSON mode, JSON schema, partial mode, web search, multimodal (image/video), file API, error handling, migration guide, complete agent example

Setup & Authentication

Get an API key from platform.moonshot.ai.

from openai import OpenAI

client = OpenAI(
    api_key="your-moonshot-api-key",  # or os.environ["MOONSHOT_API_KEY"]
    base_url="https://api.moonshot.ai/v1",
)

import OpenAI from "openai";
const client = new OpenAI({
  apiKey: "your-moonshot-api-key",
  baseURL: "https://api.moonshot.ai/v1",
});

Models

Kimi K2.5 (recommended default): Model ID kimi-k2.5. 1T-parameter MoE (32B active). 256K context. Native multimodal (text + image + video), agentic tool calling, thinking & instant modes, web search. Vision encoder: MoonViT (400M params).

Kimi K2: Model ID kimi-k2 or kimi-k2-0905. 128K/256K context. Text-only, strong tool calling.

Kimi K2 Thinking: Model ID kimi-k2-thinking. 256K context. Dedicated reasoning model with interleaved thinking and multi-step tool calling.

Basic Chat Completion

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[
        {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
        {"role": "user", "content": "Explain quantum computing in simple terms."},
    ],
    temperature=0.6,
    top_p=0.95,
    max_tokens=4096,
)
print(response.choices[0].message.content)

Key Parameters

Parameter	Type	Description
`model`	string	Model ID (e.g. `"kimi-k2.5"`)
`messages`	array	Conversation messages with `role` and `content`
`temperature`	float	0.0–1.0. Use 0.6 for instant, 1.0 for thinking
`top_p`	float	Nucleus sampling, recommended 0.95
`max_tokens`	int	Max output tokens (e.g. 4096, 8192)
`stream`	bool	Enable streaming (default: false)
`stop`	array	Up to 4 stop sequences
`n`	int	Number of completions to generate
`response_format`	object	`{"type": "json_object"}` or `{"type": "json_schema", ...}`
`tools`	array	Function definitions for tool calling
`tool_choice`	string/object	`"auto"` / `"none"` / `"required"` / specific function
`frequency_penalty`	float	Penalize frequent tokens
`presence_penalty`	float	Penalize repeated tokens

Thinking Mode vs. Instant Mode

Kimi K2.5 supports two modes controlling whether the model shows internal reasoning.

Thinking Mode (Default)

Produces reasoning traces before answering. Best for math, logic, code debugging, research. Response includes reasoning_content (the thinking process) and content (the final answer).

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
    temperature=1.0,   # recommended for thinking
    top_p=0.95,
    max_tokens=8192,
)
reasoning = response.choices[0].message.reasoning_content  # thinking trace
answer = response.choices[0].message.content                # final answer

Thinking budget tiers: ~8K tokens (routine), ~32K (complex), ~96K (frontier).

Instant Mode (Disable Thinking)

Skips reasoning traces — faster and 60–75% cheaper on tokens. Use for translation, summarization, simple Q&A, creative writing.

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[{"role": "user", "content": "Translate 'hello world' to French."}],
    temperature=0.6,   # recommended for instant
    top_p=0.95,
    max_tokens=4096,
    extra_body={"thinking": {"type": "disabled"}},
)

Alternative syntax for vLLM/SGLang self-hosted deployments:

extra_body={"chat_template_kwargs": {"thinking": False}}

Use Case	Mode	Temperature
Math / Logic / Proofs	Thinking	1.0
Code debugging & analysis	Thinking	1.0
Complex research	Thinking	1.0
Translation / Summarization	Instant	0.6
Simple Q&A / Creative writing	Instant	0.6
Tool-calling agents	Either	0.6–1.0

Tool Calling (Overview)

Kimi K2.5 has strong native tool-calling. Define tools as functions, pass them in requests, and the model decides when and how to invoke them. The model can perform interleaved thinking and multi-step tool calling — reasoning between tool calls, handling 200+ sequential calls in agentic workflows.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "required": ["city"],
            "properties": {
                "city": {"type": "string", "description": "City name"}
            }
        }
    }
}]

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    temperature=0.6,
)

When finish_reason == "tool_calls", execute the requested functions and feed results back as role: "tool" messages. See references/tool-calling.md for the complete execution loop, streaming tool calls, and tool_choice options.

JSON Mode (Overview)

Force valid JSON output with response_format. Always include a JSON instruction in your system/user message.

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[
        {"role": "system", "content": "Respond in JSON format."},
        {"role": "user", "content": "Extract name, age from: 'John is a 30-year-old engineer.'"},
    ],
    response_format={"type": "json_object"},
    temperature=0.6,
)

For strict structure, use {"type": "json_schema", "json_schema": {...}}. See references/advanced-features.md for JSON schema examples.

Partial Mode (Overview)

Provide a prefix the model must continue from. Set the last message to role: "assistant" with the prefix and add extra_body={"partial": True}.

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[
        {"role": "user", "content": "Write a fibonacci function."},
        {"role": "assistant", "content": "def fibonacci(n):\n    "},
    ],
    extra_body={"partial": True},
    temperature=0.6,
)

Useful for code completion, structured output prefixes, constrained generation. See references/advanced-features.md.

Streaming (Overview)

Enable token-by-token output with stream=True. In thinking mode, reasoning_content appears in delta chunks before content chunks.

stream = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[{"role": "user", "content": "Write a short story."}],
    stream=True,
    temperature=0.6,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

See references/streaming.md for thinking-mode streaming, tool-call streaming, and auto-reconnect patterns.

Web Search (Overview)

Enable built-in internet search with the $web_search tool (~$0.005/call). The model autonomously decides when to search.

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[{"role": "user", "content": "Latest quantum computing news?"}],
    tools=[{"type": "builtin_function", "function": {"name": "$web_search"}}],
    tool_choice="auto",
    temperature=0.6,
)

Multimodal Input (Overview)

Kimi K2.5 natively processes images and video. Use image_url or video_url content types in messages.

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
        ],
    }],
    temperature=0.6,
)

Supports URL and base64 for images (JPG/PNG/GIF/WEBP) and video (MP4). Kimi K2.5 excels at converting wireframes/mockups into code. See references/advanced-features.md for video input and base64 examples.

Multi-Turn Conversations

Accumulate messages across turns. The 256K context window handles long conversations, but implement sliding window or summarization for very long sessions.

messages = [{"role": "system", "content": "You are Kimi, a helpful assistant."}]

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})
    response = client.chat.completions.create(
        model="kimi-k2.5", messages=messages, temperature=0.6, max_tokens=4096,
    )
    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply

Best Practices

Choose the right mode: Instant for simple tasks (60–75% token savings), thinking only when reasoning quality matters
Temperature: 0.6 instant, 1.0 thinking, top_p 0.95 for both
Streaming for user-facing apps to reduce perceived latency
Retries with exponential backoff for production (see references/advanced-features.md)
Monitor tokens via the usage field in responses
JSON mode for structured extraction instead of parsing free text
Partial mode for code completion and constrained generation

Migration from OpenAI

Minimal changes — swap base_url and model:

# OpenAI → Kimi
client = OpenAI(api_key="moonshot-key", base_url="https://api.moonshot.ai/v1")
response = client.chat.completions.create(model="kimi-k2.5", ...)

Key differences: temperature range 0–1 (not 0–2), reasoning_content in thinking mode, extra_body for instant/partial modes, $web_search built-in tool, native video input. See references/advanced-features.md for full migration notes.