name: vila-dataset-config description: Configure VILA dataset registry YAML files for LMFlow training. Use when the user mentions VILA datasets, mixtures, dataset YAML, registry, j63, or needs to set up data configs for a new cluster.
VILA Dataset Registry
Location
All config files live under:
LMFlow/third_party/vila/llava/data/registry/
├── mixtures.yaml # mixture definitions (dataset lists)
└── datasets/
├── oci-nrt-cs.yaml # per-cluster dataset paths
├── dfw.yaml # cw-dfw cluster
├── draco-oci-iad.yaml # draco cluster
├── cs-oci-ord.yaml # cs-oci-ord cluster
├── default.yaml
└── ...
Two-Layer Config
Layer 1: mixtures.yaml — What datasets go together
Defines named mixtures as lists of dataset names. Example:
j63_recipe:
- sharegpt4v_gpt4_100k
- llava_instruct
- sharegpt4v_sft
- dvqa_train_200k
- chartqa_train_18k
- ai2d_train_12k
- docvqa_train_10k
- geoqa
- synthdog_en
Supports @ suffix for subsampling (e.g. sharegpt4v_sft@0.3 = 30% of data).
Layer 2: datasets/<cluster>.yaml — Where data lives on each cluster
Maps each dataset name to its actual path and loader. Example:
sharegpt4v_gpt4_100k:
_target_: llava.data.LLaVADataset
data_path: /lustre/.../sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
media_dir: /lustre/.../data
Dataset Types (_target_)
| Type | Usage | Required Fields |
|---|---|---|
llava.data.LLaVADataset |
JSONL/JSON with optional images | data_path, media_dir (if has images) |
llava.data.ListOfJsonDataset |
Directory of JSON files (e.g. shizhe data) | data_path (dir path) |
llava.data.HFArrowDataset |
HuggingFace arrow format | data_path, split |
llava.data.HFParquetDataset |
HuggingFace parquet format | data_path, split |
llava.data.HFGeneralDataset |
General HF dataset | data_path, split |
llava.data.HFVinciCoderDataset |
VinciCoder specific | data_path, split |
llava.data.DummyDataset |
Testing | num_instances |
Optional Fields
| Field | Description |
|---|---|
resample_on_failure |
bool, retry on load failure (default true) |
max_length |
int, max sequence length |
split |
str, HF dataset split (train, test, subset) |
Key Mixtures
| Name | Base | Extra |
|---|---|---|
j63_recipe |
9 SFT datasets | — |
j63_math |
j63_recipe | + math, real-cqa, tabmwp, metamathqa, mminstruct |
j63_mmfinereasoning |
j63_recipe | + MMFineReason/SFT-586K |
j63_math_mmfinereasoning |
j63_math | + MMFineReason/SFT-586K |
j63_small_fastdllmv2 |
sharegpt4v + llava + sharegpt4v_sft | + fast_dllmv2 |
mmfinereason_fastdllmv2 |
fast_dllmv2 | + MMFineReason/SFT-586K-fullset |
Creating a New Cluster YAML
- Copy an existing cluster YAML (e.g.
oci-nrt-cs.yaml) as template - For each dataset entry, update
data_pathandmedia_dirto the new cluster's paths - Keep
_target_and optional fields the same - Only include datasets that exist on the target cluster
- File name convention:
<cluster-name>.yaml
Data Path Mapping Across Clusters
| Dataset | oci-nrt root | cw-dfw root |
|---|---|---|
| j63 base (ShareGPT4V, internvl playground) | /lustre/.../shiyil/data/playground/ |
/lustre/.../chengyuew/data/vila-sft/internvl_chat/playground/ and .../vila-sft/ShareGPT4V/ |
| shizhe/fast_dllmv2 | /lustre/.../shiyil/data/playground/shizhe_stage1_no_reasoning |
/lustre/.../chengyuew/data/shizhe_stage1_no_reasoning |
| real-cqa, tabmwp, metamathqa, MMInstruct, MMFineReason | /lustre/.../shiyil/data/<name> |
/lustre/.../chengyuew/data/<name> |