name: tokenizers-0-23-1 description: Fast state-of-the-art tokenizers library for NLP written in Rust with Python, Node.js, and Ruby bindings. Use when training custom vocabularies, implementing BPE/WordPiece/Unigram tokenization, building NLP pipelines, or working with transformer models requiring efficient text preprocessing with alignment tracking.
Tokenizers 0.23.1
Overview
๐ค Tokenizers is a fast, state-of-the-art tokenization library written in Rust with bindings for Python, Node.js, and Ruby. It provides implementations of today's most used tokenizers (BPE, WordPiece, Unigram, WordLevel) with a focus on performance and versatility. The same tokenizers power the ๐ค Transformers library.
Key capabilities:
- Train new vocabularies from text corpora using BPE, WordPiece, Unigram, or WordLevel algorithms
- Extremely fast โ Rust implementation tokenizes a GB of text in under 20 seconds on a server CPU
- Full alignment tracking โ even with destructive normalization, always map tokens back to the original input text
- Complete preprocessing pipeline โ truncation, padding, special token insertion, attention masks
- Designed for both research and production
Version 0.23.1 Highlights
This is the first proper stable release in the 0.23 line (no functional 0.23.0 was published). Major changes from 0.22.x:
- Drop Python 3.9 โ requires Python 3.10+
- Python 3.14 support including free-threaded (
3.14t) wheels - Full type hints for every Python class via auto-generated
.pyistubs - Node.js multi-platform bindings restored (13 platforms: macOS x64/arm64/universal, Windows x64/i686/arm64, Linux x64/arm64/armv7 glibc+musl, Android arm64/armv7)
- Massive performance gains โ added-vocabulary deserialize up to 96% faster, BPE encode batch 16% faster
- Unigram sampling โ
alphaandnbest_sizeparameters for subword regularization - Weakref support on
Tokenizerfor long-lived caches - Truncation early exit โ right-direction truncation no longer pre-tokenizes past
max_length
When to Use
- Training a custom vocabulary from a text corpus (wikitext, domain-specific data, etc.)
- Implementing BPE, WordPiece, Unigram, or WordLevel tokenization
- Building NLP preprocessing pipelines with normalization, pre-tokenization, and post-processing
- Working with transformer models that require efficient text preprocessing
- Needing alignment tracking between tokens and original text spans (offsets)
- Loading pretrained tokenizers from the Hugging Face Hub
- Converting legacy vocabulary files into modern tokenizer format
Core Concepts
The Tokenization Pipeline
Every Tokenizer processes input text through a four-stage pipeline:
- Normalization โ Clean and standardize raw text (Unicode normalization, lowercasing, accent stripping). Alignment is tracked so tokens can always be mapped back to original text.
- Pre-tokenization โ Split normalized text into sub-units (words, bytes, characters) that define upper bounds for final tokens.
- Model โ The core algorithm (BPE, WordPiece, Unigram, or WordLevel) that splits pre-tokens into vocabulary tokens and maps them to integer IDs.
- Post-processing โ Add special tokens (
[CLS],[SEP], etc.), set type IDs for sequence pairs.
Models
The model is the only mandatory component. It defines the tokenization algorithm:
- BPE (Byte-Pair Encoding) โ Starts from characters, iteratively merges most frequent pairs. Used by GPT-2, RoBERTa.
- WordPiece โ Greedy longest-match algorithm using
##prefix for continuing subwords. Used by BERT. - Unigram โ Probabilistic model that selects the best tokenization to maximize sentence probability. Now supports subword regularization via
alphaandnbest_size. Used by mBART, XLM-R. - WordLevel โ Simple word-to-ID mapping. Requires very large vocabularies for good coverage.
Alignment Tracking
A distinguishing feature of ๐ค Tokenizers is full alignment tracking. Even after destructive normalization (lowercasing, accent removal), every token in the output can be mapped back to its exact span in the original input text via the offsets attribute of an Encoding.
Encoding Object
The output of tokenization is an Encoding object containing:
tokensโ List of token stringsidsโ List of integer IDstype_idsโ Sequence type IDs (0 for first sequence, 1 for second in pairs)attention_maskโ Binary mask indicating real tokens vs paddingoffsetsโ Character-level(start, end)tuples mapping each token back to the original text
Installation / Setup
Install from PyPI:
pip install tokenizers
Requires Python 3.10+. Prebuilt wheels are available for Linux, macOS, and Windows. Python 3.14 (regular and free-threaded 3.14t) supported.
Install from source (requires Rust toolchain):
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
pip install -e .
Usage Examples
Quickstart: Train a BPE Tokenizer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
# Initialize with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
# Set pre-tokenizer to split on whitespace
tokenizer.pre_tokenizer = Whitespace()
# Configure trainer
trainer = BpeTrainer(
vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
# Train on files
files = ["data/wiki.train.raw", "data/wiki.valid.raw", "data/wiki.test.raw"]
tokenizer.train(files, trainer)
# Save and reload
tokenizer.save("tokenizer.json")
tokenizer = Tokenizer.from_file("tokenizer.json")
Encode Text
output = tokenizer.encode("Hello, y'all! How are you?")
print(output.tokens) # ["Hello", ",", "y", "'", "all", ...]
print(output.ids) # [27253, 16, 93, 11, 5097, ...]
print(output.offsets) # [(0, 5), (5, 6), (6, 7), ...]
Alignment Tracking
Map a token back to its original text span:
output = tokenizer.encode("Hello, y'all! How are you ๐ ?")
# Find what caused the [UNK] token (index 9)
start, end = output.offsets[9]
print(sentence[start:end]) # "๐"
Post-processing with Special Tokens
Add [CLS] and [SEP] tokens for BERT-style inputs:
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
# Single sequence
output = tokenizer.encode("Hello there")
print(output.tokens) # ["[CLS]", "Hello", "there", "[SEP]"]
# Sequence pair with type IDs
output = tokenizer.encode("Hello there", "How are you?")
print(output.type_ids) # [0, 0, 0, 0, 1, 1, 1, 1]
Batch Encoding with Padding
# Enable automatic padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
outputs = tokenizer.encode_batch([
"Short sentence",
"This is a much longer sentence that needs padding"
])
print(outputs[0].attention_mask) # [1, 1, 1, 1, 0, ...] (padded positions = 0)
Load Pretrained from Hub
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Advanced Topics
Tokenization Pipeline: Normalization, pre-tokenization, model, and post-processing in detail with the BERT-from-scratch example โ Tokenization Pipeline
Models Reference: BPE, WordPiece, Unigram, and WordLevel โ parameters, behavior, and training algorithms โ Models Reference
Components Reference: Normalizers, pre-tokenizers, post-processors, decoders, trainers, and added tokens โ Components Reference
API Reference: Tokenizer class methods, Encoding object attributes, input types, padding/truncation, batch operations, and the visualizer tool โ API Reference