rnow-config - SKILL.md Agent Skill

name: rnow-config description: Configure ReinforceNow training runs with config.yml and train.jsonl. Also covers converting HuggingFace datasets to ReinforceNow format. Triggers on "config.yml", "train.jsonl", "training config", "batch_size", "group_size", "max_turns", "qlora", "HuggingFace", "dataset", "convert dataset".

ReinforceNow Configuration

This guide covers config.yml and train.jsonl setup for RL, SFT, and Distillation training.

Project Structure

my_project/
├── config.yml          # Training configuration (required)
├── train.jsonl         # Training data (required)
├── rewards.py          # Reward functions (required for RL, not needed for SFT/Distillation)
├── tools.py            # Tool definitions (optional, RL only)
└── requirements.txt    # Python dependencies (optional)

config.yml

Minimal RL Config

project_name: "My RL Project"
dataset_type: rl

data:
  train_file: train.jsonl
  batch_size: 4
  group_size: 8

model:
  path: Qwen/Qwen3-8B

trainer:
  num_epochs: 10
  learning_rate: 0.0001

Minimal SFT Config

project_name: "My SFT Project"
dataset_type: sft

data:
  train_file: train.jsonl
  batch_size: 4
  val_split: 0.2

model:
  path: Qwen/Qwen3-8B

trainer:
  num_epochs: 10
  learning_rate: 0.0001

Minimal Distillation Config

On-policy distillation trains a student model to match a teacher model's behavior. The student generates, the teacher grades each token, and KL divergence provides supervision.

project_name: "My Distillation Project"
dataset_type: distill

data:
  train_file: train.jsonl
  batch_size: 8
  group_size: 4

model:
  path: Qwen/Qwen3-8B        # Student model

teacher:
  path: Qwen/Qwen3-32B       # Teacher model (larger)

rollout:
  max_context_window: 8192

trainer:
  num_epochs: 3
  learning_rate: 0.0001

Key points:

No rewards.py needed - teacher provides all supervision via KL penalty
Student generates on its own distribution (on-policy)
Teacher computes log probabilities for each token
KL penalty coefficient is 1.0 (full weight on teacher supervision)

Full RL Config (All Options)

# Project identification (auto-filled by rnow init)
project_id: ""
project_name: "My RL Project"
dataset_type: rl
description: "Training description"

# Data configuration
data:
  train_file: train.jsonl       # Path to training data
  batch_size: 16                # 1-32, prompts per batch
  group_size: 4                 # 1-64, rollouts per prompt (RL only)
  # NOTE: batch_size * group_size <= 2048

# Model configuration
model:
  path: Qwen/Qwen3-8B           # Model name or checkpoint ID
  qlora_rank: 32                # LoRA rank (model-specific max)
  qlora_alpha: 64               # LoRA alpha (default: rank * 2)
  name: "custom-model-name"     # Optional output name
  description: "Model desc"     # Optional description

# RL algorithm (RL only)
algorithm:
  loss_fn: ppo                  # 'ppo' or 'importance_sampling'
  adv_estimator: grpo           # 'grpo', 'gae', or 'reinforce'
  kl_penalty_coef: 0.01         # KL divergence penalty

# Rollout configuration (RL only)
rollout:
  max_turns: 1                  # Max conversation turns
  max_context_window: 2048              # Max tokens per generation
  termination_policy: last_tool # 'last_tool' or 'max_turns'
  reasoning_mode: null          # null, 'disabled', 'low', 'medium', 'high'
  mcp_url: null                 # MCP server URL(s)
  tool_timeout: 60              # Tool execution timeout
  max_context_window: 32768    # Max context window in tokens (tool results auto-truncated)
  include_thinking: false       # Include <think> in history

# Training configuration
trainer:
  num_epochs: 30                # Number of epochs
  learning_rate: 0.0001         # Learning rate
  save_step: 20                 # -1 = end only, N = every N steps

# Run-dependent evals (optional, top-level)
evals:
  - eval_id: your_eval_id       # From rnow eval
    step: 100                   # Run every 100 steps

Configuration Sections

data

Field	Required	Default	Description
`train_file`	Yes	train.jsonl	Path to training data
`batch_size`	Yes	-	Prompts per batch (1-32)
`group_size`	RL only	4	Rollouts per prompt (1-64)
`val_split`	SFT only	0.0	Validation split ratio (0.0-1.0)

Important: batch_size * group_size must be <= 2048 (concurrency limit).

model

Field	Required	Default	Description
`path`	Yes	-	Model name or checkpoint ID
`qlora_rank`	No	32	LoRA rank for efficient finetuning
`qlora_alpha`	No	rank * 2	LoRA alpha scaling

Supported Models

Qwen (Text)

Qwen/Qwen3-8B (max rank: 128)
Qwen/Qwen3-4B-Instruct-2507 (max rank: 128)
Qwen/Qwen3-30B-A3B (max rank: 64)
Qwen/Qwen3-30B-A3B-Instruct-2507 (max rank: 64)
Qwen/Qwen3-32B (max rank: 128)
Qwen/Qwen3-235B-A22B-Instruct-2507 (max rank: 64)

Qwen (Vision)

Qwen/Qwen3-VL-30B-A3B-Instruct
Qwen/Qwen3-VL-235B-A22B-Instruct

Meta Llama (max rank: 128)

meta-llama/Llama-3.3-70B-Instruct
meta-llama/Llama-3.1-70B
meta-llama/Llama-3.1-8B
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.2-3B
meta-llama/Llama-3.2-1B

DeepSeek (max rank: 64)

deepseek-ai/DeepSeek-V3.1
deepseek-ai/DeepSeek-V3.1-Base

OpenAI (max rank: 32)

openai/gpt-oss-120b
openai/gpt-oss-20b

Moonshot

moonshotai/Kimi-K2-Thinking

Multi-Model Training

Train multiple models with the same config:

model:
  path:
    - Qwen/Qwen3-4B-Instruct-2507
    - Qwen/Qwen3-8B
    - Qwen/Qwen3-30B-A3B
  qlora_rank: 32

The CLI submits separate runs for each model.

teacher (Distillation only)

Field	Required	Description
`path`	Yes	Teacher model name (must be a supported model)

The teacher provides supervision via reverse KL divergence. Use a larger/more capable model as teacher:

teacher:
  path: Qwen/Qwen3-32B  # 32B teacher distilling to 8B student

Teacher selection tips:

Use a model from the same family (e.g., Qwen teacher for Qwen student)
Larger teachers generally produce better students
Teacher must be a supported model (see model list above)

algorithm (RL only)

Field	Default	Options
`loss_fn`	ppo	`ppo`, `importance_sampling`
`adv_estimator`	grpo	`grpo`, `gae`, `reinforce`
`kl_penalty_coef`	0.01	KL divergence penalty weight

Recommendations:

Default ppo + grpo works well for most tasks
Lower kl_penalty_coef (0.001) for more exploration
Higher kl_penalty_coef (0.1) for stability

rollout (RL and Distillation)

Field	Default	Description
`max_turns`	1	Max conversation turns
`max_context_window`	2048	Max tokens per generation
`termination_policy`	last_tool	When to end episode
`reasoning_mode`	null	Chain-of-thought mode
`mcp_url`	null	MCP server URL(s)
`tool_timeout`	60	Tool execution timeout
`max_context_window`	32768	Max context window in tokens
`include_thinking`	false	Keep `<think>` in history

Termination Policies

Policy	Behavior
`last_tool`	Episode ends when model responds without tool call
`max_turns`	Episode always runs for exactly max_turns

Reasoning Mode

For models that support chain-of-thought (<think> tags):

Mode	Description
`null`	Auto-enable for supported models
`disabled`	Explicitly disable reasoning
`low`	Light reasoning
`medium`	Moderate reasoning
`high`	Deep reasoning (more tokens)

Important: Reasoning models need higher max_context_window (8192-16384).

MCP Configuration

Connect to external MCP servers for tools:

# Single server
rollout:
  mcp_url: "https://mcp.tavily.com/mcp/?tavilyApiKey=YOUR_KEY"

# Multiple servers
rollout:
  mcp_url:
    - "https://mcp.tavily.com/mcp/?tavilyApiKey=..."
    - "https://mcp.exa.ai/mcp/?apiKey=..."

# In-sandbox MCP (requires docker in train.jsonl)
rollout:
  mcp_url: localhost:8931

trainer

Field	Required	Default	Description
`num_epochs`	Yes	-	Number of training epochs
`learning_rate`	Yes	-	Learning rate
`save_step`	No	-1	-1 = end only, N = every N steps

evals (top-level)

Run-dependent evaluations that trigger automatically during training. These run in separate containers and log pass@k metrics to training graphs.

Setup:

Create a standalone eval first using the UI or API
Note its eval_id from the evals page
Reference it in your config.yml

evals:
  - eval_id: cmla1l13e000004jwxu39jrpy  # From standalone eval
    step: 100                            # Run every 100 steps
    name: "MATH"                         # Display name in graphs (optional)

Field	Required	Default	Description
`eval_id`	Yes	-	Source eval ID (must exist)
`step`	Yes	-	Run eval every N training steps
`name`	No	eval_id[:8]	Display name for metrics in graphs

How it works:

Trainer spawns eval in a separate Modal container at each step interval
Eval reuses source eval's files from S3 (train.jsonl, rewards.py, config)
Does NOT create new Eval records - just logs metrics
pass@k scores appear in "Evaluation" section of training graphs

pass@k configuration: The pass@1, pass@4, pass@8 metrics are configured on the source eval when you create it. Run-dependent evals inherit these settings. Only enabled metrics appear in graphs.

Graph display:

Section: "Evaluation"
Graphs:  "MATH_pass1", "MATH_pass4", "MATH_pass8"
         (or "cmla1l13_pass1" etc. if no name specified)

Multiple evals example:

evals:
  - eval_id: abc123...
    step: 50
    name: "MATH"
  - eval_id: xyz789...
    step: 100
    name: "GSM8K"

train.jsonl Format

For full train.jsonl documentation including message format, sandbox/docker configuration, and examples, see the rnow-train-jsonl skill.

Converting HuggingFace Datasets

This section shows how to convert HuggingFace datasets to train.jsonl format.

SFT Conversion

For SFT, include both user and assistant messages:

from datasets import load_dataset
import json

dataset = load_dataset("your-dataset-name", split="train")

with open("train.jsonl", "w") as f:
    for row in dataset:
        entry = {
            "messages": [
                {"role": "user", "content": row["question"]},
                {"role": "assistant", "content": row["answer"]}
            ]
        }
        f.write(json.dumps(entry) + "\n")

Alpaca-style Dataset

from datasets import load_dataset
import json

dataset = load_dataset("tatsu-lab/alpaca", split="train")

with open("train.jsonl", "w") as f:
    for row in dataset:
        if row.get("input"):
            user_content = f"{row['instruction']}\n\nInput: {row['input']}"
        else:
            user_content = row["instruction"]

        entry = {
            "messages": [
                {"role": "user", "content": user_content},
                {"role": "assistant", "content": row["output"]}
            ]
        }
        f.write(json.dumps(entry) + "\n")

Multi-turn Conversations

# Input: {"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
messages = []
for turn in row["conversations"]:
    role = "user" if turn["from"] == "human" else "assistant"
    messages.append({"role": role, "content": turn["value"]})

entry = {"messages": messages}

RL Conversion

For RL, include only the prompt (user message). The model generates responses during training.

from datasets import load_dataset
import json

dataset = load_dataset("your-math-dataset", split="train")

with open("train.jsonl", "w") as f:
    for row in dataset:
        entry = {
            "messages": [
                {"role": "user", "content": row["question"]}
            ],
            "rewards": ["accuracy"],
            "metadata": {
                "expected_answer": row["answer"]
            }
        }
        f.write(json.dumps(entry) + "\n")

GSM8K Math Dataset

from datasets import load_dataset
import json
import re

dataset = load_dataset("gsm8k", "main", split="train")

with open("train.jsonl", "w") as f:
    for row in dataset:
        # Extract final answer (#### followed by number)
        answer_match = re.search(r"####\s*(.+)$", row["answer"])
        final_answer = answer_match.group(1).strip() if answer_match else row["answer"]

        entry = {
            "messages": [{"role": "user", "content": row["question"]}],
            "rewards": ["accuracy"],
            "metadata": {"expected_answer": final_answer}
        }
        f.write(json.dumps(entry) + "\n")

MATH Dataset (Competition Math)

from datasets import load_dataset
import json
import re

dataset = load_dataset("hendrycks/competition_math", split="train")

def extract_boxed(text: str) -> str:
    match = re.search(r"\\boxed\{([^}]+)\}", text)
    return match.group(1) if match else text

with open("train.jsonl", "w") as f:
    for row in dataset:
        answer = extract_boxed(row["solution"])
        # Wrap in $$ for math-verify (skip if already delimited or plain number)
        if not answer.startswith(("$", "\\(")) and not answer.replace(".", "").replace("-", "").isdigit():
            answer = f"$${answer}$$"

        entry = {
            "messages": [{"role": "user", "content": row["problem"]}],
            "rewards": ["accuracy"],
            "metadata": {"expected_answer": answer}
        }
        f.write(json.dumps(entry) + "\n")

Note: For math-verify, expected_answer MUST have math delimiters ( $...$ or $...$). Raw LaTeX like \sqrt{2} won't parse - use $\sqrt{2}$ . Plain numbers like 42 work as-is.

For reward function examples (math-verify, llm_judge), see the rnow-rewards skill.

Common Configurations

Math Reasoning

project_name: "Math Reasoning"
dataset_type: rl

data:
  train_file: train.jsonl
  batch_size: 8
  group_size: 8

model:
  path: Qwen/Qwen3-8B
  qlora_rank: 64

algorithm:
  loss_fn: ppo
  adv_estimator: grpo
  kl_penalty_coef: 0.01

rollout:
  max_turns: 1
  max_context_window: 8192  # High for reasoning
  reasoning_mode: medium

trainer:
  num_epochs: 20
  learning_rate: 0.0001

Agent with Tools

project_name: "Search Agent"
dataset_type: rl

data:
  train_file: train.jsonl
  batch_size: 4
  group_size: 4

model:
  path: Qwen/Qwen3-8B

rollout:
  max_turns: 5
  max_context_window: 2048
  termination_policy: last_tool
  tool_timeout: 30

trainer:
  num_epochs: 15
  learning_rate: 0.0001

Code Execution

project_name: "Code Agent"
dataset_type: rl

data:
  train_file: train.jsonl
  batch_size: 2
  group_size: 4

model:
  path: Qwen/Qwen3-8B

rollout:
  max_turns: 3
  max_context_window: 4096
  tool_timeout: 120  # Longer for code execution

trainer:
  num_epochs: 10
  learning_rate: 0.0001

SFT for Instruction Following

project_name: "Instruction Tuning"
dataset_type: sft

data:
  train_file: train.jsonl
  batch_size: 8
  val_split: 0.1

model:
  path: Qwen/Qwen3-8B
  qlora_rank: 32

trainer:
  num_epochs: 3
  learning_rate: 0.00005

Distillation for Reasoning

project_name: "Distilled Reasoning Model"
dataset_type: distill

data:
  train_file: train.jsonl
  batch_size: 8
  group_size: 4

model:
  path: Qwen/Qwen3-8B          # Student
  qlora_rank: 32

teacher:
  path: Qwen/Qwen3-32B         # Teacher

rollout:
  max_context_window: 8192             # Enough for reasoning

trainer:
  num_epochs: 3
  learning_rate: 0.0001
  save_step: 20

When to use distillation:

Transfer reasoning capabilities from a large model to a smaller one
Create a cost-effective model that approximates a larger model's behavior
On-policy distillation avoids exposure bias (student learns from its own mistakes)

Validation Rules

batch_size * group_size <= 2048
qlora_rank <= model's max rank
Rewards in train.jsonl must exist in rewards.py (RL only)
Tools in train.jsonl must exist in tools.py (RL only)
sandbox=True requires docker field (RL only)
max_tokens must fit in context window
Distillation requires teacher section with valid model path

Testing Configuration

# Validate and test locally
rnow test -n 3 --verbose

# Test specific entries
rnow test --entry 0,1,2

# Override model for testing
rnow test --model gpt-5-nano