name: huggingface-api description: Integrates Hugging Face APIs (Inference Client, Inference Endpoints, Transformers Pipeline, Datasets) for serverless and dedicated model inference with Python. license: MIT compatibility: opencode metadata: version: "1.0.0" domain: coding triggers: hugging face, huggingface, transformers, inference api, inference endpoints, pipelines, how do i use hugging face models, hf inference client archetypes:
- tactical
- generation anti_triggers:
- brainstorming
- vague ideation
- code golf
- over-engineering response_profile: verbosity: low directive_strength: high abstraction_level: operational role: implementation scope: implementation output-format: code content-types:
- code
- guidance
- examples
- do-dont related-skills: coding-openai-api, coding-replicate-api, coding-langchain
Hugging Face API Integration
Integrates Hugging Face APIs for model inference using the huggingface_hub InferenceClient, dedicated Inference Endpoints, transformers pipelines, and datasets library. When loaded, this skill makes the model implement Hugging Face model inference with proper token management, task routing, batching, and error handling.
When to Use
Use this skill when:
- Running inference on Hugging Face models via the serverless Inference API
- Deploying and managing dedicated Inference Endpoints for production workloads
- Using
transformerspipelines for local or batch inference with Hugging Face models - Loading and processing datasets with the
datasetslibrary for ML pipelines - Building text classification, generation, embedding, or image analysis pipelines
- Using custom inference handlers for specialized model deployment
- Integrating Hugging Face models as LLM providers in agentic applications
When NOT to Use
- For OpenAI API calls, use
coding-openai-api - For running models on Replicate, use
coding-replicate-api - For building fine-tuned models, refer to the
transformersTrainer documentation
Core Workflow
Choose Inference Mode — Select between three modes: (a) Serverless Inference API —
InferenceClientwith automatic provider routing, best for prototyping; (b) Dedicated Inference Endpoints — deployed on managed infrastructure, best for production with guaranteed performance; (c) Local Transformers Pipeline — runs models locally on your hardware, best for offline or batch processing. Checkpoint: For production, always use dedicated Inference Endpoints — the serverless API has rate limits and no SLA.Initialize the Client — For serverless inference, use
InferenceClient(token=...)orInferenceClient(api_key=...)for OpenAI compatibility. The token defaults to theHF_TOKENenvironment variable. For dedicated endpoints, useget_inference_endpoint()to manage the endpoint lifecycle. Checkpoint: Verify connectivity by callingclient.get_model_status()or a simple text classification query.Route Tasks to the Right Pipeline — Use the
modelparameter to specify a Hugging Face model ID (e.g.,"meta-llama/Meta-Llama-3-8B-Instruct") or a URL to a deployed Inference Endpoint. The client auto-selects the task based on the model. For chat completion, the client appends/v1/chat/completionsto the URL for OpenAI compatibility. Checkpoint: Verify the model supports your task by checking its task tag on the Hub.Handle Batching and Streaming — For local pipelines, use
batch_sizefor parallel inference on GPUs. For theInferenceClient, send requests individually (batching is not supported server-side). For streaming text generation, passstream=Trueto the chat completion method. Checkpoint: For large datasets, usedatasetsIterableDataset with streaming to avoid memory issues.Manage Inference Endpoints — Use the
huggingface_hubAPI to create, update, pause, resume, and delete endpoints. Setmin_replica,max_replica, andscale_to_zero_timeoutfor autoscaling. Access the endpoint client viaendpoint.clientfor inference calls. Checkpoint: After creating an endpoint, verifyendpoint.statusis"running"before sending requests.
Implementation Patterns
Pattern 1: Serverless Inference with InferenceClient
from __future__ import annotations
from huggingface_hub import InferenceClient
# ❌ BAD — no error handling, no token management, hardcoded model
import requests
resp = requests.post(
"https://api-inference.huggingface.co/models/gpt2",
headers={"Authorization": "Bearer hf_..."},
json={"inputs": "Hello"},
)
print(resp.json())
# ✅ GOOD — InferenceClient, env-based token, typed error handling
client = InferenceClient() # reads HF_TOKEN from environment
def classify_text(texts: list[str], model: str | None = None) -> list[dict]:
"""Classify texts using Hugging Face serverless inference.
Args:
texts: List of text strings to classify.
model: Optional model ID. Defaults to a task-appropriate model.
Returns:
List of classification results with label and score.
Raises:
RuntimeError: On inference failures or authentication errors.
"""
try:
kwargs = {}
if model:
kwargs["model"] = model
results = client.text_classification(texts, **kwargs)
return results
except Exception as e:
error_str = str(e)
if "401" in error_str or "authorization" in error_str.lower():
raise ValueError("Invalid Hugging Face token. Set HF_TOKEN.") from e
raise RuntimeError(f"Inference failed: {e}") from e
def chat_completion(
messages: list[dict[str, str]],
model: str = "meta-llama/Meta-Llama-3-8B-Instruct",
max_tokens: int = 512,
) -> str:
"""Chat completion using a Hugging Face model via the InferenceClient.
Uses the OpenAI-compatible /v1/chat/completions endpoint.
Args:
messages: List of {"role": ..., "content": ...} dicts.
model: Hugging Face model ID or endpoint URL.
max_tokens: Maximum generation tokens.
Returns:
Generated response text.
"""
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
)
return response.choices[0].message.content or ""
Pattern 2: Dedicated Inference Endpoints
from __future__ import annotations
from huggingface_hub import InferenceClient, get_inference_endpoint
from huggingface_hub import create_inference_endpoint, InferenceEndpointStatus
class ManagedEndpoint:
"""Manage and use a dedicated Inference Endpoint."""
def __init__(self, endpoint_name: str, namespace: str | None = None) -> None:
self.endpoint_name = endpoint_name
self.namespace = namespace
def ensure_running(self) -> InferenceClient:
"""Ensure the endpoint is running and return a client.
Raises:
RuntimeError: If the endpoint cannot be started.
"""
endpoint = get_inference_endpoint(
self.endpoint_name,
namespace=self.namespace,
)
if endpoint.status != InferenceEndpointStatus.RUNNING:
endpoint.resume()
endpoint.wait() # blocks until running
return endpoint.client # returns InferenceClient pointed at this endpoint
def deploy(
self,
model_repo: str,
instance_type: str = "intel-icl",
instance_size: str = "x2",
min_replica: int = 0,
max_replica: int = 1,
) -> InferenceClient:
"""Create and deploy a new Inference Endpoint.
Args:
model_repo: Hugging Face model repo ID (e.g., "meta-llama/Meta-Llama-3-8B").
instance_type: Cloud instance type.
instance_size: Instance size.
min_replica: Minimum replicas (0 enables scale-to-zero).
max_replica: Maximum replicas for autoscaling.
Returns:
InferenceClient for the new endpoint.
"""
endpoint = create_inference_endpoint(
name=self.endpoint_name,
repository=model_repo,
framework="pytorch",
accelerator="gpu",
instance_type=instance_type,
instance_size=instance_size,
min_replica=min_replica,
max_replica=max_replica,
scale_to_zero_timeout=15, # minutes of inactivity before scale to 0
)
endpoint.wait() # blocks until deployment completes
return endpoint.client
Pattern 3: Local Transformers Pipeline
from __future__ import annotations
from transformers import pipeline
class LocalPipeline:
"""Run inference locally using Hugging Face transformers pipelines."""
def __init__(self, task: str, model: str, device: int = -1) -> None:
"""Initialize a local pipeline.
Args:
task: Pipeline task (e.g., "text-classification", "text-generation",
"feature-extraction", "automatic-speech-recognition").
model: Hugging Face model ID.
device: Device ID (-1 for CPU, 0 for first GPU).
"""
self.pipe = pipeline(task=task, model=model, device=device)
def predict(self, inputs: str | list[str], **kwargs) -> list[dict]:
"""Run inference on one or more inputs.
Args:
inputs: Single string or list of strings.
**kwargs: Additional pipeline parameters.
Returns:
List of prediction results.
"""
return self.pipe(inputs, **kwargs)
def predict_batched(
self, inputs: list[str], batch_size: int = 8, **kwargs
) -> list[dict]:
"""Run batched inference for higher throughput on GPU.
Args:
inputs: List of input strings.
batch_size: Number of items per batch.
**kwargs: Additional pipeline parameters.
Returns:
Concatenated list of prediction results.
"""
results: list[dict] = []
for i in range(0, len(inputs), batch_size):
batch = inputs[i : i + batch_size]
batch_results = self.pipe(batch, batch_size=len(batch), **kwargs)
results.extend(batch_results)
return results
Constraints
MUST DO
- Set the
HF_TOKENenvironment variable or passtokentoInferenceClientfor authenticated access - Use dedicated Inference Endpoints for production workloads with guaranteed latency
- Use the OpenAI-compatible
client.chat.completions.create()for LLM chat completion when the model supports it - Use
batch_sizeparameter for local pipeline inference on GPU for throughput optimization - Check endpoint status after creation/update — only send requests when
status == "running" - Handle 401 errors as authentication failures requiring a valid HF token
MUST NOT DO
- Hardcode Hugging Face tokens in source files — use
HF_TOKENenvironment variable - Use the serverless Inference API for production workloads — it has rate limits and no latency guarantees
- Skip checking the endpoint status before making inference calls against dedicated endpoints
- Use
requests.post()directly to the Inference API — useInferenceClientinstead
Live References
| Resource | URL |
|---|---|
| Hugging Face Hub Python Library | https://pypi.org/project/huggingface-hub/ |
| InferenceClient Reference | https://huggingface.co/docs/huggingface_hub/package_reference/inference_client |
| Inference Endpoints API | https://huggingface.co/docs/huggingface_hub/package_reference/inference_endpoints |
| Transformers Pipeline Tutorial | https://huggingface.co/docs/transformers/main/pipeline_tutorial |
| Datasets Library | https://huggingface.co/docs/datasets/main/use_with_pytorch |
| Inference Toolkit (custom handlers) | https://huggingface.co/docs/inference-endpoints/en/engines/toolkit |
| Hugging Face Inference Endpoints Docs | https://huggingface.co/docs/inference-endpoints/en/faq |
Related Skills
| Skill | Purpose |
|---|---|
coding-openai-api |
OpenAI API for alternative inference provider |
coding-replicate-api |
Replicate for cloud-hosted open-source model inference |
coding-langchain |
Cross-provider orchestration including Hugging Face integrations |