tokenizers-0-23-1 - SKILL.md Agent Skill

name: tokenizers-0-23-1 description: Fast state-of-the-art tokenizers library for NLP written in Rust with Python, Node.js, and Ruby bindings. Use when training custom vocabularies, implementing BPE/WordPiece/Unigram tokenization, building NLP pipelines, or working with transformer models requiring efficient text preprocessing with alignment tracking.

Tokenizers 0.23.1

Overview

🤗 Tokenizers is a fast, state-of-the-art tokenization library written in Rust with bindings for Python, Node.js, and Ruby. It provides implementations of today's most used tokenizers (BPE, WordPiece, Unigram, WordLevel) with a focus on performance and versatility. The same tokenizers power the 🤗 Transformers library.

Key capabilities:

Train new vocabularies from text corpora using BPE, WordPiece, Unigram, or WordLevel algorithms
Extremely fast — Rust implementation tokenizes a GB of text in under 20 seconds on a server CPU
Full alignment tracking — even with destructive normalization, always map tokens back to the original input text
Complete preprocessing pipeline — truncation, padding, special token insertion, attention masks
Designed for both research and production

Version 0.23.1 Highlights

This is the first proper stable release in the 0.23 line (no functional 0.23.0 was published). Major changes from 0.22.x:

Drop Python 3.9 — requires Python 3.10+
Python 3.14 support including free-threaded (3.14t) wheels
Full type hints for every Python class via auto-generated .pyi stubs
Node.js multi-platform bindings restored (13 platforms: macOS x64/arm64/universal, Windows x64/i686/arm64, Linux x64/arm64/armv7 glibc+musl, Android arm64/armv7)
Massive performance gains — added-vocabulary deserialize up to 96% faster, BPE encode batch 16% faster
Unigram sampling — alpha and nbest_size parameters for subword regularization
Weakref support on Tokenizer for long-lived caches
Truncation early exit — right-direction truncation no longer pre-tokenizes past max_length

When to Use

Training a custom vocabulary from a text corpus (wikitext, domain-specific data, etc.)
Implementing BPE, WordPiece, Unigram, or WordLevel tokenization
Building NLP preprocessing pipelines with normalization, pre-tokenization, and post-processing
Working with transformer models that require efficient text preprocessing
Needing alignment tracking between tokens and original text spans (offsets)
Loading pretrained tokenizers from the Hugging Face Hub
Converting legacy vocabulary files into modern tokenizer format

Core Concepts

The Tokenization Pipeline

Every Tokenizer processes input text through a four-stage pipeline:

Normalization — Clean and standardize raw text (Unicode normalization, lowercasing, accent stripping). Alignment is tracked so tokens can always be mapped back to original text.
Pre-tokenization — Split normalized text into sub-units (words, bytes, characters) that define upper bounds for final tokens.
Model — The core algorithm (BPE, WordPiece, Unigram, or WordLevel) that splits pre-tokens into vocabulary tokens and maps them to integer IDs.
Post-processing — Add special tokens ([CLS], [SEP], etc.), set type IDs for sequence pairs.

Models

The model is the only mandatory component. It defines the tokenization algorithm:

BPE (Byte-Pair Encoding) — Starts from characters, iteratively merges most frequent pairs. Used by GPT-2, RoBERTa.
WordPiece — Greedy longest-match algorithm using ## prefix for continuing subwords. Used by BERT.
Unigram — Probabilistic model that selects the best tokenization to maximize sentence probability. Now supports subword regularization via alpha and nbest_size. Used by mBART, XLM-R.
WordLevel — Simple word-to-ID mapping. Requires very large vocabularies for good coverage.

Alignment Tracking

A distinguishing feature of 🤗 Tokenizers is full alignment tracking. Even after destructive normalization (lowercasing, accent removal), every token in the output can be mapped back to its exact span in the original input text via the offsets attribute of an Encoding.

Encoding Object

The output of tokenization is an Encoding object containing:

tokens — List of token strings
ids — List of integer IDs
type_ids — Sequence type IDs (0 for first sequence, 1 for second in pairs)
attention_mask — Binary mask indicating real tokens vs padding
offsets — Character-level (start, end) tuples mapping each token back to the original text

Installation / Setup

Install from PyPI:

pip install tokenizers

Requires Python 3.10+. Prebuilt wheels are available for Linux, macOS, and Windows. Python 3.14 (regular and free-threaded 3.14t) supported.

Install from source (requires Rust toolchain):

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
pip install -e .

Usage Examples

Quickstart: Train a BPE Tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

# Initialize with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# Set pre-tokenizer to split on whitespace
tokenizer.pre_tokenizer = Whitespace()

# Configure trainer
trainer = BpeTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

# Train on files
files = ["data/wiki.train.raw", "data/wiki.valid.raw", "data/wiki.test.raw"]
tokenizer.train(files, trainer)

# Save and reload
tokenizer.save("tokenizer.json")
tokenizer = Tokenizer.from_file("tokenizer.json")

Encode Text

output = tokenizer.encode("Hello, y'all! How are you?")

print(output.tokens)   # ["Hello", ",", "y", "'", "all", ...]
print(output.ids)       # [27253, 16, 93, 11, 5097, ...]
print(output.offsets)   # [(0, 5), (5, 6), (6, 7), ...]

Alignment Tracking

Map a token back to its original text span:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
# Find what caused the [UNK] token (index 9)
start, end = output.offsets[9]
print(sentence[start:end])  # "😁"

Post-processing with Special Tokens

Add [CLS] and [SEP] tokens for BERT-style inputs:

from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

# Single sequence
output = tokenizer.encode("Hello there")
print(output.tokens)  # ["[CLS]", "Hello", "there", "[SEP]"]

# Sequence pair with type IDs
output = tokenizer.encode("Hello there", "How are you?")
print(output.type_ids)  # [0, 0, 0, 0, 1, 1, 1, 1]

Batch Encoding with Padding

# Enable automatic padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

outputs = tokenizer.encode_batch([
    "Short sentence",
    "This is a much longer sentence that needs padding"
])

print(outputs[0].attention_mask)  # [1, 1, 1, 1, 0, ...]  (padded positions = 0)

Load Pretrained from Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Advanced Topics

Tokenization Pipeline: Normalization, pre-tokenization, model, and post-processing in detail with the BERT-from-scratch example → Tokenization Pipeline

Models Reference: BPE, WordPiece, Unigram, and WordLevel — parameters, behavior, and training algorithms → Models Reference

Components Reference: Normalizers, pre-tokenizers, post-processors, decoders, trainers, and added tokens → Components Reference

API Reference: Tokenizer class methods, Encoding object attributes, input types, padding/truncation, batch operations, and the visualizer tool → API Reference