name: "think-augmented-function-calling-improving" description: | Improve LLM function/tool calling accuracy by injecting explicit "think" reasoning parameters into function schemas before the model generates arguments. Based on the TAFC framework (arXiv:2601.18282). Use this skill when: - "Add reasoning to my tool definitions" - "My agent is passing wrong parameters to functions" - "Improve function calling accuracy for complex APIs" - "Debug why my LLM agent picks incorrect argument values" - "Make my tool calls more reliable with chain-of-thought" - "Augment my OpenAI function schemas with reasoning"
Think-Augmented Function Calling (TAFC)
This skill enables Claude to apply the Think-Augmented Function Calling framework to improve how LLMs generate function call parameters. The core technique injects an optional think string parameter into every tool/function schema so the model must articulate its reasoning before committing to argument values. For complex parameters with interdependencies, TAFC triggers per-parameter reasoning based on a complexity score. This requires zero architectural changes to LLMs and preserves full OpenAI-compatible API compatibility -- you only modify the JSON schemas you pass to the model.
When to Use
- When building an AI agent whose tool calls frequently have incorrect or hallucinated parameter values
- When your function has interdependent parameters (e.g.,
start_datemust precedeend_date,currencymust matchcountry) - When debugging why an LLM agent chose a specific argument value and you need interpretable traces
- When wrapping third-party APIs with complex validation rules and the LLM struggles with constraint satisfaction
- When designing a function-calling pipeline and you want to maximize first-call accuracy without fine-tuning
- When you have multi-parameter functions where the model tends to ignore descriptions or constraints
Key Technique
Function-level think augmentation. TAFC adds a single optional think parameter to every function schema. This parameter has no effect on execution -- the runtime strips it before calling the real function. But it forces the model into a causal ordering: generate reasoning first, then generate the actual arguments conditioned on that reasoning. The probabilistic relationship is P(params, think | context) = P(think | context) * P(params | context, think). By making the model write out its logic, parameter accuracy improves because the model resolves ambiguities and dependencies in the reasoning step rather than guessing during argument generation.
Parameter-level granular reasoning. Not every parameter needs deep reasoning. TAFC scores each parameter's complexity using three normalized factors: (1) dep -- how many other parameters it depends on or affects, (2) type -- how complex the data type is (a nested object scores higher than a string), and (3) constraint -- how strict the validation rules are (enums, regex patterns, ranges). The composite score psi = sigmoid(a1*dep + a2*type + a3*constraint) determines whether a parameter gets its own think_<param> reasoning field. Parameters exceeding a threshold (typically 0.6) get individual reasoning slots. This avoids wasting tokens on trivial parameters while ensuring critical ones receive dedicated attention.
Dynamic description optimization. TAFC iteratively refines the think parameter's description using execution feedback. After observing which calls succeed and fail, a meta-pass rewrites the think parameter's description to better guide reasoning. This is a discrete optimization loop: score current descriptions by downstream accuracy, then prompt an LLM to improve descriptions that correlate with failures. In practice, 3-5 iterations are enough to converge.
Step-by-Step Workflow
Inventory your function schemas. Collect all tool/function definitions your agent uses. For each function, list every parameter with its type, description, constraints, and dependencies on other parameters.
Add the function-level
thinkparameter. Insert an optional string parameter namedthinkinto each function'spropertieswith a description that instructs the model to reason about parameter selection. Place it first in the property order so the model generates it before other arguments.Score each parameter's complexity. For every parameter, compute three factors on a 0-1 scale:
dep: 0 if independent, 0.5 if it shares a soft relationship with another param, 1.0 if another param's valid values depend on ittype: 0 for primitives (string, int, bool), 0.5 for enums or arrays, 1.0 for nested objects or union typesconstraint: 0 for unconstrained, 0.5 for simple validation (min/max, enum), 1.0 for regex patterns, cross-field rules, or format requirements- Composite:
psi = sigmoid(0.4*dep + 0.3*type + 0.3*constraint)
Add parameter-level
think_<name>fields for complex params. For any parameter wherepsi > 0.6, add a companionthink_<name>string property. Its description should name the specific constraints and dependencies the model must reason about.Update the
requiredarray. Keep allthinkandthink_<name>fields optional. Never require them -- this preserves backward compatibility and lets simpler calls skip reasoning.Modify your function execution layer. Before dispatching the actual function call, strip all
think*keys from the arguments dict. The reasoning parameters are for the model only; the function never sees them.Test with representative inputs. Run your agent on 10-20 diverse queries. Inspect the
thinkvalues to verify the model is reasoning about the right constraints. Look for cases where reasoning is correct but the parameter is still wrong (description problem) vs. reasoning is wrong (prompt problem).Iterate on think descriptions. For functions with low accuracy, rewrite the
thinkparameter description to be more specific. Replace generic "reason about parameters" with explicit instructions like "Consider the timezone of the user before selecting the date format."Optionally automate description refinement. Build a feedback loop: collect (input, generated_args, success/failure) tuples, prompt an LLM to revise the think description to prevent observed failure modes, and re-evaluate. Three to five rounds typically suffice.
Deploy and monitor. Log the
thinkvalues in production alongside the function call results. These traces are invaluable for debugging agent behavior without needing to replay the full conversation context.
Concrete Examples
Example 1: Augmenting a hotel booking API
User: "My agent keeps booking hotels with check-out dates before check-in dates. How do I fix this with TAFC?"
Approach:
- Examine the current function schema
- Add function-level and parameter-level think fields
- Strip think params in the execution layer
Original schema:
{
"name": "book_hotel",
"parameters": {
"type": "object",
"properties": {
"hotel_id": { "type": "string", "description": "Hotel identifier" },
"check_in": { "type": "string", "format": "date", "description": "Check-in date (YYYY-MM-DD)" },
"check_out": { "type": "string", "format": "date", "description": "Check-out date (YYYY-MM-DD)" },
"guests": { "type": "integer", "minimum": 1, "maximum": 10 },
"room_type": { "type": "string", "enum": ["standard", "deluxe", "suite"] }
},
"required": ["hotel_id", "check_in", "check_out", "guests"]
}
}
TAFC-augmented schema:
{
"name": "book_hotel",
"parameters": {
"type": "object",
"properties": {
"think": {
"type": "string",
"description": "Before filling parameters, reason about: (1) which dates the user mentioned and their chronological order, (2) whether check_out is after check_in, (3) how many guests were specified or implied, (4) any room preference stated."
},
"hotel_id": { "type": "string", "description": "Hotel identifier" },
"check_in": { "type": "string", "format": "date", "description": "Check-in date (YYYY-MM-DD)" },
"think_check_out": {
"type": "string",
"description": "Verify: this date must be strictly after check_in. If the user said 'three nights from Jan 5', compute Jan 5 + 3 = Jan 8."
},
"check_out": { "type": "string", "format": "date", "description": "Check-out date (YYYY-MM-DD), must be after check_in" },
"guests": { "type": "integer", "minimum": 1, "maximum": 10 },
"room_type": { "type": "string", "enum": ["standard", "deluxe", "suite"] }
},
"required": ["hotel_id", "check_in", "check_out", "guests"]
}
}
Execution layer (Python):
def execute_tool_call(name: str, arguments: dict) -> Any:
# Strip all TAFC reasoning parameters before dispatch
clean_args = {k: v for k, v in arguments.items() if not k.startswith("think")}
return registry[name](**clean_args)
Example 2: Multi-API agent with currency/locale dependencies
User: "Build me a price lookup tool where the currency must match the market region."
Approach:
- Define the function with interdependent region and currency params
- Score complexity -- currency depends on region (dep=1.0), has enum constraint (constraint=0.5), is a primitive (type=0): psi = sigmoid(0.41.0 + 0.30 + 0.3*0.5) = sigmoid(0.55) ~ 0.63 > 0.6, so it gets a think field
- Add targeted reasoning
Output schema:
{
"name": "get_product_price",
"parameters": {
"type": "object",
"properties": {
"think": {
"type": "string",
"description": "Reason about: which market region the user is asking about, and which currency is valid for that region (US->USD, EU->EUR, JP->JPY, UK->GBP)."
},
"product_id": { "type": "string" },
"region": { "type": "string", "enum": ["US", "EU", "JP", "UK"] },
"think_currency": {
"type": "string",
"description": "The currency MUST match the region: US->USD, EU->EUR, JP->JPY, UK->GBP. State which region was selected and derive the currency."
},
"currency": { "type": "string", "enum": ["USD", "EUR", "JPY", "GBP"] }
},
"required": ["product_id", "region", "currency"]
}
}
Example 3: Applying TAFC to an existing OpenAI function-calling codebase
User: "I have 15 tool definitions in my agent. How do I apply TAFC across all of them efficiently?"
Approach:
- Write a schema transformer function
- Apply complexity scoring programmatically
- Inject think parameters automatically
import math
def compute_complexity(param_name: str, param_schema: dict, all_params: dict) -> float:
"""Score a parameter's complexity on [0, 1]."""
# Dependency: check if description references other param names
dep = 0.0
desc = param_schema.get("description", "").lower()
for other in all_params:
if other != param_name and other.lower() in desc:
dep = 1.0
break
# Type complexity
ptype = param_schema.get("type", "string")
if ptype in ("object", "array") or "anyOf" in param_schema or "oneOf" in param_schema:
type_score = 1.0
elif "enum" in param_schema:
type_score = 0.5
else:
type_score = 0.0
# Constraint complexity
constraint = 0.0
if "pattern" in param_schema or "format" in param_schema:
constraint = 1.0
elif any(k in param_schema for k in ("minimum", "maximum", "minLength", "maxLength", "enum")):
constraint = 0.5
raw = 0.4 * dep + 0.3 * type_score + 0.3 * constraint
return 1 / (1 + math.exp(-5 * (raw - 0.3))) # sigmoid centered at 0.3, steepness 5
def augment_schema_with_tafc(tool_def: dict, threshold: float = 0.6) -> dict:
"""Add TAFC think parameters to a function schema."""
import copy
augmented = copy.deepcopy(tool_def)
params = augmented["parameters"]["properties"]
# Add function-level think
param_names = list(params.keys())
think_desc = (
f"Before selecting parameter values, reason step-by-step about what the user "
f"wants and how the parameters ({', '.join(param_names)}) relate to each other."
)
# Insert think first via ordered rebuild
new_props = {"think": {"type": "string", "description": think_desc}}
for name, schema in params.items():
score = compute_complexity(name, schema, params)
if score > threshold:
new_props[f"think_{name}"] = {
"type": "string",
"description": f"Reason about constraints and dependencies for '{name}' before assigning its value."
}
new_props[name] = schema
augmented["parameters"]["properties"] = new_props
return augmented
Usage:
for tool in my_tools:
tool["function"] = augment_schema_with_tafc(tool["function"])
Best Practices
- Do: Place
thinkas the first property in the schema so the model generates reasoning before any parameter values. JSON property order influences generation order in most LLM APIs. - Do: Write think descriptions that name the specific constraints and relationships to reason about, not generic instructions. "Verify the end_date is after start_date" beats "Think carefully."
- Do: Log think values in production -- they are free debugging traces that explain why your agent chose specific parameter values.
- Do: Strip think parameters server-side before execution so downstream functions never see them and existing code remains unchanged.
- Avoid: Requiring the think parameter. Keep it optional so the model can skip it for trivial calls, reducing latency and token cost.
- Avoid: Adding parameter-level think fields to simple params (booleans, unconstrained strings). The overhead is not worth it when
psi < 0.6. - Avoid: Overly long think descriptions. The model reads them on every call. Keep each under 50 words.
Error Handling
| Problem | Cause | Fix |
|---|---|---|
Model ignores think and still gets params wrong |
Think description is too vague | Rewrite with explicit constraint names and cross-references |
Model fills think but contradicts itself in actual params |
Weak causal coupling | Move think earlier in property order; some APIs respect insertion order |
| Token usage spikes after adding TAFC | Too many parameter-level think fields | Raise the complexity threshold from 0.6 to 0.75 |
Function execution fails with unexpected think key |
Execution layer not stripping think params | Add the {k:v for k,v in args.items() if not k.startswith("think")} filter |
| Think reasoning is repetitive across calls | Static description | Use the iterative refinement loop to evolve descriptions based on failure patterns |
Limitations
- Token cost. Every think parameter adds 20-100 tokens per call. For high-throughput, low-complexity tools (e.g., simple CRUD), the overhead may not justify the accuracy gain.
- Model sensitivity. The technique relies on the model respecting property order and descriptions. Models with weak instruction-following may ignore think fields entirely.
- No guarantee of consistency. The model may reason correctly in the think field but still produce an incorrect parameter value. TAFC improves odds, it does not enforce correctness.
- Single-turn only. TAFC augments individual function calls. It does not help with multi-turn planning where the agent must decide which function to call across a sequence.
- Diminishing returns on simple functions. Functions with 1-2 independent, unconstrained parameters see negligible improvement. The technique shines on 4+ parameter functions with interdependencies.
Reference
Think-Augmented Function Calling: Improving LLM Parameter Accuracy Through Embedded Reasoning -- Wei et al., 2026. Look for: Section 3 (TAFC framework definition), Section 3.3 (complexity scoring formula), and Section 4 (ToolBench evaluation showing 2-5% pass rate improvements across model scales).