vila-dataset-config

name: vila-dataset-config description: Configure VILA dataset registry YAML files for LMFlow training. Use when the user mentions VILA datasets, mixtures, dataset YAML, registry, j63, or needs to set up data configs for a new cluster.

VILA Dataset Registry

Location

All config files live under:

LMFlow/third_party/vila/llava/data/registry/
├── mixtures.yaml              # mixture definitions (dataset lists)
└── datasets/
    ├── oci-nrt-cs.yaml        # per-cluster dataset paths
    ├── dfw.yaml               # cw-dfw cluster
    ├── draco-oci-iad.yaml     # draco cluster
    ├── cs-oci-ord.yaml        # cs-oci-ord cluster
    ├── default.yaml
    └── ...

Two-Layer Config

Layer 1: `mixtures.yaml` — What datasets go together

Defines named mixtures as lists of dataset names. Example:

j63_recipe:
    - sharegpt4v_gpt4_100k
    - llava_instruct
    - sharegpt4v_sft
    - dvqa_train_200k
    - chartqa_train_18k
    - ai2d_train_12k
    - docvqa_train_10k
    - geoqa
    - synthdog_en

Supports @ suffix for subsampling (e.g. sharegpt4v_sft@0.3 = 30% of data).

Layer 2: `datasets/<cluster>.yaml` — Where data lives on each cluster

Maps each dataset name to its actual path and loader. Example:

sharegpt4v_gpt4_100k:
    _target_: llava.data.LLaVADataset
    data_path: /lustre/.../sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
    media_dir: /lustre/.../data

Dataset Types (`_target_`)

Type	Usage	Required Fields
`llava.data.LLaVADataset`	JSONL/JSON with optional images	`data_path`, `media_dir` (if has images)
`llava.data.ListOfJsonDataset`	Directory of JSON files (e.g. shizhe data)	`data_path` (dir path)
`llava.data.HFArrowDataset`	HuggingFace arrow format	`data_path`, `split`
`llava.data.HFParquetDataset`	HuggingFace parquet format	`data_path`, `split`
`llava.data.HFGeneralDataset`	General HF dataset	`data_path`, `split`
`llava.data.HFVinciCoderDataset`	VinciCoder specific	`data_path`, `split`
`llava.data.DummyDataset`	Testing	`num_instances`

Optional Fields

Field	Description
`resample_on_failure`	bool, retry on load failure (default true)
`max_length`	int, max sequence length
`split`	str, HF dataset split (`train`, `test`, `subset`)

Key Mixtures

Name	Base	Extra
`j63_recipe`	9 SFT datasets	—
`j63_math`	j63_recipe	+ math, real-cqa, tabmwp, metamathqa, mminstruct
`j63_mmfinereasoning`	j63_recipe	+ MMFineReason/SFT-586K
`j63_math_mmfinereasoning`	j63_math	+ MMFineReason/SFT-586K
`j63_small_fastdllmv2`	sharegpt4v + llava + sharegpt4v_sft	+ fast_dllmv2
`mmfinereason_fastdllmv2`	fast_dllmv2	+ MMFineReason/SFT-586K-fullset

Creating a New Cluster YAML

Copy an existing cluster YAML (e.g. oci-nrt-cs.yaml) as template
For each dataset entry, update data_path and media_dir to the new cluster's paths
Keep _target_ and optional fields the same
Only include datasets that exist on the target cluster
File name convention: <cluster-name>.yaml

Data Path Mapping Across Clusters

Dataset	oci-nrt root	cw-dfw root
j63 base (ShareGPT4V, internvl playground)	`/lustre/.../shiyil/data/playground/`	`/lustre/.../chengyuew/data/vila-sft/internvl_chat/playground/` and `.../vila-sft/ShareGPT4V/`
shizhe/fast_dllmv2	`/lustre/.../shiyil/data/playground/shizhe_stage1_no_reasoning`	`/lustre/.../chengyuew/data/shizhe_stage1_no_reasoning`
real-cqa, tabmwp, metamathqa, MMInstruct, MMFineReason	`/lustre/.../shiyil/data/<name>`	`/lustre/.../chengyuew/data/<name>`