name: nanecho-custom-vocab description: "Build custom BPE vocabularies and tokenizers for the NanEcho model (drzo/echoself) from domain-specific JSONL training data. Covers data analysis, tokenizer training, nanoGPT binary preparation, 9cog/echoself repo integration, and triggering the netrain-cached GitHub Actions training pipeline. Use when adapting NanEcho to a new domain with specialized terminology."
NanEcho Custom Vocabulary Training
Triggers
Use this skill when the user wants to:
- Train the NanEcho model with a custom vocabulary.
- Build a custom tokenizer for NanEcho from their own data.
- Fine-tune NanEcho on a new domain with specialized terminology.
- Create a more efficient version of NanEcho for a specific task.
Example prompts:
- "I want to train NanEcho on my own dataset with a custom vocabulary."
- "How can I build a custom tokenizer for the echoself model?"
- "Fine-tune nanecho with a new vocabulary based on these documents."
Workflow
The end-to-end process is broken down into the following phases:
Phase 1: Data Analysis & Preparation
- Receive Training Data: The user must provide one or more training data files in JSONL format. These files should contain the text corpus for the new domain.
- Analyze Data: Before proceeding, analyze the provided JSONL files to understand their structure, content, and size. Use a script to count documents, analyze text length distribution, and identify the keys containing the text content.
Phase 2: Custom Tokenizer Training
Train BPE Tokenizer: Use the
scripts/build_tokenizer.pyscript to train a custom Byte-Pair Encoding (BPE) tokenizer from the training corpus.- This script takes the raw text from the JSONL files and trains a
tokenizersmodel. - It defines special tokens crucial for the Deep Tree Echo cognitive architecture (e.g.,
<|echo|>,<|agent|>,<|b9|>). - The output is a directory (e.g.,
dte_tokenizer/) containingtokenizer.jsonand other configuration files.
python3 scripts/build_tokenizer.py \ --data-files /path/to/your/data.jsonl \ --output-dir dte_tokenizer \ --vocab-size 8192- This script takes the raw text from the JSONL files and trains a
Phase 3: Data Tokenization for nanoGPT
Prepare Binary Data: Use the
scripts/prepare_data.pyscript to convert the raw JSONL data into tokenizedtrain.binandval.binfiles using the newly trained tokenizer.- This script reads the JSONL files, formats the text with special tokens (
<|startoftext|>,<|endoftext|>), and uses the custom tokenizer to encode the text into integer token IDs. - The output is a pair of binary files containing
uint16numpy arrays, which is the format expected by thenanoGPTtraining pipeline in the9cog/echoselfrepository.
python3 scripts/prepare_data.py \ --tokenizer-dir dte_tokenizer \ --data-files /path/to/your/data.jsonl \ --output-dir data/nanecho_dte- This script reads the JSONL files, formats the text with special tokens (
Phase 4: Repository Integration
- Clone Repository: Clone the
9cog/echoselfrepository. - Copy Artifacts: Copy the generated tokenizer directory (
dte_tokenizer/) and the prepared data directory (data/nanecho_dte/) into the cloned repository. Place them inNanEcho/dte_tokenizer/anddata/nanecho_dte/respectively. - Add Integration Scripts: Copy the integration scripts from this skill (
scripts/) into theNanEcho/directory of the repository. - Commit and Push: Commit all the new files (tokenizer, data, scripts) to a new feature branch and push it to the
9cog/echoselfrepository.
Phase 5: Trigger Training
Dispatch Workflow: Trigger the
netrain-cached.ymlGitHub Actions workflow.- Crucially, you must override the default parameters to point to the new data directory and specify the new vocabulary size.
- This is done by modifying the
train_cached.pycommand in the workflow or by passing inputs if the workflow is configured to accept them.
Example invocation (if using command-line overrides):
python train_cached.py \ --data_dir data/nanecho_dte \ --vocab_size 8192 \ --n_layer 4 --n_head 4 --n_embd 256 # Match model size to vocab size
Phase 6: Model Conversion & Deployment
- Download Checkpoint: Once training is complete, download the best checkpoint (
ckpt.pt) from the workflow artifacts. - Convert to HuggingFace: Use the
scripts/convert_to_huggingface.pyscript to convert thenanoGPT-style checkpoint into a standard HuggingFace model format. This script bundles the custom tokenizer with the model weights. - Upload to Hub: Upload the converted model directory to the HuggingFace Hub.
Resources
- Scripts: See the
scripts/directory for the Python scripts required for this workflow. - References: See the
references/directory for documentation on the tokenizer configuration and special tokens. - Templates: See the
templates/directory for a template of thedte_tokenizer_config.jsonfile.