inference-scaling-guide - SKILL.md Agent Skill

name: inference-scaling-guide description: Guides users through inference-time scaling with its_hub, including algorithm selection (Self-Consistency, Best-of-N, Beam Search, Particle Filtering), budget tuning, reward model setup, tool-calling integration, interpreting results, and troubleshooting. Use when the user is working with its_hub, asking about scaling algorithms, debugging scaling issues, or tuning inference quality.

Inference-Time Scaling Guide

its_hub generates multiple LLM responses and selects the best one using voting, scoring, or search. All algorithms share the same interface: ainfer(lm, prompt, budget) (async) or infer(...) (sync).

For API reference and conceptual overviews, consult the docs at https://ai-innovation.team/its_hub and the docs/ directory. This skill covers practical knowledge, decision frameworks, and troubleshooting.

Algorithm Selection

Need	Algorithm	Why
Fast improvement, tool calling	Self-Consistency	Voting is cheap, no reward model needed, excellent for tool-call consensus
Highest quality single response	Best-of-N	Scores every candidate, picks the best — requires a reward model
Step-by-step reasoning	Beam Search	Evaluates partial solutions at each step — requires process reward model + GPU
Complex multi-path reasoning	Particle Filtering	Maintains diverse reasoning paths — requires process reward model + GPU
Long multi-step tasks	Entropic Particle Filtering	Avoids premature convergence on long sequences — requires process reward model + GPU

Decision framework

No GPU for a reward model? → Self-Consistency (no reward model needed)
Have a judge model or API? → Best-of-N with LLM Judge
Have a local GPU + PRM? → Beam Search or Particle Filtering depending on task complexity
Tool-calling task? → Self-Consistency with tool_vote="tool_hierarchical" is the recommended starting point

Budget Tuning

The budget parameter controls how many LLM calls are made per prompt:

Algorithm	Budget meaning	Starting point	Diminishing returns
Self-Consistency	Number of parallel generations	5-8	Beyond 16 for most tasks
Best-of-N	Number of candidates to score	4-8	Beyond 16
Beam Search	Total generations (= beam_width × steps)	16-32	Depends on step count
Particle Filtering	Number of particles	8-16	Beyond 32

Budget vs cost: each budget unit = 1 LLM call. Budget 8 costs 8x a single call. Start low, increase only if quality improves.

Budget vs latency: Self-Consistency and Best-of-N run in parallel (latency ≈ single call). Beam Search and Particle Filtering are sequential per step (latency ≈ budget × step time).

Reward Models

Outcome Reward Models (ORM)

Score complete responses. Used by Best-of-N.

LLM Judge (easiest setup — uses an LLM to score):

from its_hub import LLMJudge, OpenAICompatibleLanguageModel

judge_lm = OpenAICompatibleLanguageModel(
    endpoint="https://api.openai.com/v1",
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="gpt-4o-mini"
)
judge = LLMJudge(lm=judge_lm, fallback_score=5.0)

The judge model can be the same as the generation model, but using a stronger model as judge improves quality.

Process Reward Models (PRM)

Score each reasoning step. Used by Beam Search and Particle Filtering. Requires a local GPU.

from its_hub.core.reward_models.local_vllm_prm import LocalVllmProcessRewardModel

prm = LocalVllmProcessRewardModel(
    model_name="Qwen/Qwen2.5-Math-PRM-7B",
    device="cuda:0",
    aggregation_method="prod"  # or "mean", "min", "max"
)

Aggregation methods:

prod: Product of step scores (strict — one bad step kills the score)
mean: Average of step scores (forgiving)
min: Worst step score (conservative)
max: Best step score (optimistic)

Start with prod for math, mean for general reasoning.

Tool-Calling Integration

Self-Consistency supports voting on tool calls, not just text:

sc = SelfConsistency(tool_vote="tool_hierarchical")
result = sc.infer(lm, messages, budget=5, tools=tools, tool_choice="auto")

Tool voting modes:

tool_name: Vote on which tool to call
tool_args: Vote on tool arguments
tool_hierarchical (recommended): First vote on tool name, then on arguments within the winning tool
exclude_args=["timestamp", "id"]: Exclude non-semantic arguments from voting

Best-of-N also works with tool calls when using an LLM Judge that understands tool-call quality.

Step Generation

For Beam Search and Particle Filtering, configure how the LLM generates incrementally:

from its_hub import StepGeneration

sg = StepGeneration(
    max_steps=32,          # Maximum reasoning steps
    step_token="\n\n",     # Split on double newlines
    stop_token=r"\boxed",  # Stop when final answer found
)

Tuning:

max_steps: Higher for complex problems. 16-32 is typical for math.
step_token: Use "\n\n" for chain-of-thought, "\n" for more granular steps.
stop_token: Match your expected answer format (\boxed for math, custom for other tasks).

Concurrency Control

All algorithms accept an optional orchestrator for controlling parallelism:

from its_hub import LMOrchestrator

orchestrator = LMOrchestrator(max_concurrency=4)
sc = SelfConsistency(orchestrator=orchestrator)

When to tune:

Rate-limited APIs: Set max_concurrency to stay under the limit
Local vLLM: Higher concurrency (16-32) is fine
Gateway integration: Implement AbstractOrchestrator with your own rate limiting

Interpreting Results

Self-Consistency

Good sign: Most responses agree (e.g., 6/8 voted for the same answer)
Bad sign: No clear majority — problem may be ambiguous or model is uncertain. Try higher budget or a better model.

Best-of-N

Good sign: Top score is significantly higher than average
Bad sign: All scores are similar — the judge can't differentiate. Try a stronger judge or different scoring criteria.

Beam Search / Particle Filtering

Good sign: Final beam scores are high and diverse
Bad sign: All particles collapsed to the same path — try Entropic Particle Filtering for more diversity.

Common Issues

Symptom	Cause	Fix
All responses identical	Temperature too low or budget too low	Increase temperature (0.7-1.0) or budget
Self-Consistency ties	Budget too low for the task	Increase budget to odd number (5, 7, 9)
Best-of-N picks poor response	Judge model not strong enough	Use a stronger judge model or tune the prompt
Beam Search OOM	PRM too large for GPU	Use a smaller PRM or offload to different GPU (`device="cuda:1"`)
Particle Filtering slow	Sequential step generation	Reduce `max_steps` or switch to Self-Consistency for speed
Rate limit errors	Too many parallel calls	Set `LMOrchestrator(max_concurrency=N)`
Empty or null results	LM endpoint unreachable or API key invalid	Verify endpoint with a single `lm.agenerate_single()` call

Resource Cleanup

Always close the LM after use:

# Async context manager (recommended)
async with OpenAICompatibleLanguageModel(...) as lm:
    result = await algorithm.ainfer(lm, prompt, budget=5)

# Sync usage — explicit close
lm = OpenAICompatibleLanguageModel(...)
result = algorithm.infer(lm, prompt, budget=5)
asyncio.run(lm.close())

Performance Tips

Start with Self-Consistency — cheapest, fastest, no reward model needed
Upgrade to Best-of-N when you have a judge — better quality, same latency
Use Beam Search for step-by-step math/reasoning — highest quality on those tasks
Try Entropic Particle Filtering if standard PF converges too early
Monitor GPU memory when using local reward models — PRMs are 7B+ parameters
Benchmark with scripts/benchmark.py on MATH500 or AIME-2024 to compare algorithms for your model

Reference Documentation

Detailed documentation for specific topics lives in the docs/ directory:

docs/algorithms.md — Full code examples for every algorithm (Self-Consistency, Best-of-N, Beam Search, Particle Filtering, Entropic PF), tool-calling integration, step generation config, and reward model setup
docs/orchestration.md — Concurrency control, custom orchestrator implementation for gateway deployments, async/sync usage patterns
docs/benchmarking.md — How to benchmark algorithms on MATH500 and AIME-2024, budget scaling analysis
docs/iaas-service.md — Running the Inference-as-a-Service HTTP server
docs/quick-start.md — Getting started from zero