vila-dataset-config

star 0

Configure VILA dataset registry YAML files for LMFlow training. Use when the user mentions VILA datasets, mixtures, dataset YAML, registry, j63, or needs to set up data configs for a new cluster.

voidrank By voidrank schedule Updated 2/28/2026

name: vila-dataset-config description: Configure VILA dataset registry YAML files for LMFlow training. Use when the user mentions VILA datasets, mixtures, dataset YAML, registry, j63, or needs to set up data configs for a new cluster.

VILA Dataset Registry

Location

All config files live under:

LMFlow/third_party/vila/llava/data/registry/
├── mixtures.yaml              # mixture definitions (dataset lists)
└── datasets/
    ├── oci-nrt-cs.yaml        # per-cluster dataset paths
    ├── dfw.yaml               # cw-dfw cluster
    ├── draco-oci-iad.yaml     # draco cluster
    ├── cs-oci-ord.yaml        # cs-oci-ord cluster
    ├── default.yaml
    └── ...

Two-Layer Config

Layer 1: mixtures.yaml — What datasets go together

Defines named mixtures as lists of dataset names. Example:

j63_recipe:
    - sharegpt4v_gpt4_100k
    - llava_instruct
    - sharegpt4v_sft
    - dvqa_train_200k
    - chartqa_train_18k
    - ai2d_train_12k
    - docvqa_train_10k
    - geoqa
    - synthdog_en

Supports @ suffix for subsampling (e.g. sharegpt4v_sft@0.3 = 30% of data).

Layer 2: datasets/<cluster>.yaml — Where data lives on each cluster

Maps each dataset name to its actual path and loader. Example:

sharegpt4v_gpt4_100k:
    _target_: llava.data.LLaVADataset
    data_path: /lustre/.../sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
    media_dir: /lustre/.../data

Dataset Types (_target_)

Type Usage Required Fields
llava.data.LLaVADataset JSONL/JSON with optional images data_path, media_dir (if has images)
llava.data.ListOfJsonDataset Directory of JSON files (e.g. shizhe data) data_path (dir path)
llava.data.HFArrowDataset HuggingFace arrow format data_path, split
llava.data.HFParquetDataset HuggingFace parquet format data_path, split
llava.data.HFGeneralDataset General HF dataset data_path, split
llava.data.HFVinciCoderDataset VinciCoder specific data_path, split
llava.data.DummyDataset Testing num_instances

Optional Fields

Field Description
resample_on_failure bool, retry on load failure (default true)
max_length int, max sequence length
split str, HF dataset split (train, test, subset)

Key Mixtures

Name Base Extra
j63_recipe 9 SFT datasets
j63_math j63_recipe + math, real-cqa, tabmwp, metamathqa, mminstruct
j63_mmfinereasoning j63_recipe + MMFineReason/SFT-586K
j63_math_mmfinereasoning j63_math + MMFineReason/SFT-586K
j63_small_fastdllmv2 sharegpt4v + llava + sharegpt4v_sft + fast_dllmv2
mmfinereason_fastdllmv2 fast_dllmv2 + MMFineReason/SFT-586K-fullset

Creating a New Cluster YAML

  1. Copy an existing cluster YAML (e.g. oci-nrt-cs.yaml) as template
  2. For each dataset entry, update data_path and media_dir to the new cluster's paths
  3. Keep _target_ and optional fields the same
  4. Only include datasets that exist on the target cluster
  5. File name convention: <cluster-name>.yaml

Data Path Mapping Across Clusters

Dataset oci-nrt root cw-dfw root
j63 base (ShareGPT4V, internvl playground) /lustre/.../shiyil/data/playground/ /lustre/.../chengyuew/data/vila-sft/internvl_chat/playground/ and .../vila-sft/ShareGPT4V/
shizhe/fast_dllmv2 /lustre/.../shiyil/data/playground/shizhe_stage1_no_reasoning /lustre/.../chengyuew/data/shizhe_stage1_no_reasoning
real-cqa, tabmwp, metamathqa, MMInstruct, MMFineReason /lustre/.../shiyil/data/<name> /lustre/.../chengyuew/data/<name>
Install via CLI
npx skills add https://github.com/voidrank/slurm_skill --skill vila-dataset-config
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator