name: huggingface-git-xet-dataset-publisher description: >- Manage Hugging Face dataset Git repositories with Git Xet/LFS-compatible tracking, including setup, safe snapshot commits, push verification, binary rejection recovery, and README track-rule maintenance.
Hugging Face Git Xet Dataset Publisher
Use this skill when the task is about operating a Hugging Face dataset Git repository (for example LanceDB data snapshots), including:
- initial setup,
- bootstrap from a non-Git local folder,
- daily commits/pushes,
- binary-rejected push recovery,
- track rule maintenance in
.gitattributesandREADME.md.
When To Use
- Bind a local data directory to
git@hf.co:datasets/<org_or_user>/<repo>. - Configure Git Xet for large/binary dataset artifacts.
- Commit and push snapshot updates safely.
- Fix
pre-receive hook declinedbinary rejection errors. - Standardize tracking rules across docs and repo config.
Required Inputs
repo_dir: local dataset repo path (for example/mnt/wsl/data4tb/static-flow-data/lancedb).hf_remote: dataset SSH remote (git@hf.co:datasets/...).branch: target branch (defaultmain).- File patterns that must be Xet-tracked (default below).
Recommended default track patterns for LanceDB-like repos:
*.lance*.txn*.manifest*.idx- Optional extras if present in repo:
*.arrow
Hard Rules
- Always run from
repo_dirand print current branch/remote before mutation. - Never force-push unless user explicitly requests.
- Before each push, verify offending binary paths are tracked (
filter=lfs). - If a new binary extension appears, update both:
.gitattributes(track rule),README.mdsetup instructions.
- Treat
filter=lfsin.gitattributesas expected for Git Xet on HF. - If source data is actively being written (for example live LanceDB), pause writers before snapshot commit/push to avoid inconsistent snapshots.
Non-Git Folder Bootstrap (Given HF Remote)
Use this when repo_dir is just a plain folder (no .git) and user provides
hf_remote.
Step 0: Enter source directory
cd <repo_dir>
Step 1: Detect remote state
- Check if remote already has
main:git ls-remote --heads <hf_remote> main
Step 2A: Remote already has history (recommended in-place bootstrap)
- Initialize and bind:
git initgit remote add origin <hf_remote>
- Align local
mainto remote:git fetch origin maingit checkout -B main origin/main
Step 2B: Remote is empty (first publish)
- Initialize and bind:
git init -b maingit remote add origin <hf_remote>
Step 3: Enable Xet and add track rules
git xet installgit xet track "*.lance" "*.txn" "*.manifest" "*.idx"- Optional if present:
git xet track "*.arrow"
Step 4: Re-index with tracking and commit
- Stage all:
git add -A
- If files were previously staged/committed without tracking, re-index once:
git rm -r --cached .git add -A
- Commit:
git commit -m "data: initial sync with xet tracking" || echo "no changes"
Step 5: Push
git push origin main- Verify:
git rev-list --left-right --count origin/main...main- expected after success:
0 0
Preflight Checklist
- Verify tooling:
git --versiongit lfs versiongit xet --version
- Verify remote:
git remote -v
- Verify branch/divergence:
git branch -vvgit fetch origingit rev-list --left-right --count origin/main...main
- Verify current tracking:
git check-attr filter -- <sample_file>git lfs ls-files | head
Setup Workflow (First-Time)
- Initialize or bind repo:
git init -b main(if needed)git remote add origin <hf_remote>git fetch origin maingit switch -C main origin/main- if folder is not a Git repo yet, prefer the full bootstrap section above
(
Non-Git Folder Bootstrap (Given HF Remote)).
- Install/enable Xet (if not already):
git xet install
- Add track rules:
git xet track "*.lance"git xet track "*.txn"git xet track "*.manifest"git xet track "*.idx"
- Validate:
rg -n "\\*\\.(lance|txn|manifest|idx)" .gitattributes
Daily Snapshot Workflow
- Confirm clean branch target:
git switch maingit fetch origin
- Add/commit:
git add -Agit commit -m "data: sync <timestamp>" || echo "no changes"
- Push:
git push origin main
- Post-push sanity:
- ensure no remote rejection,
- ensure expected commit appears in
git log --oneline -n 3.
Binary Rejection Recovery Workflow
Error signature:
Your push was rejected because it contains binary filesOffending files: .../*.idx(or another extension)
A) Diagnose precisely
- Check whether offending files are tracked:
git check-attr filter -- <offending_path>
- If
filter: unspecified, add missing rule:git xet track "*.idx"(or relevant extension)
B) Recover from contaminated local history safely
Use this when local main has bad commits and push keeps failing.
- Preserve local pointer:
git switch -c backup/pre-xet-fix-<timestamp>
- Align
mainto clean remote tip:git switch -C main origin/main
- Re-apply current local data changes (if any), then:
git add -Agit commit -m "data: sync and fix xet tracking"
- Push:
git push origin main
C) Alternative clean-branch publication
Use when branch surgery on local main is undesirable.
- Create clean branch from remote tip:
git switch -c clean-main origin/main
- Bring desired files into clean branch.
- Ensure track rules include offending extension(s).
- Commit and publish:
git push origin clean-main:main
- Realign local main:
git switch -C main origin/main
README Maintenance Contract
Whenever adding a new tracked binary extension:
- Add it to
.gitattributesviagit xet track. - Update README setup snippet so future runs include it.
- Update storage-format description if needed (for example mention
*.idxsidecars).
Verification Commands (Must Run Before Final Report)
git rev-list --left-right --count origin/main...main- expected after success:
0 0or0 Nbefore final push.
- expected after success:
git check-attr filter -- <offending_paths>- expected:
filter: lfs
- expected:
git lfs ls-files | rg "\\.idx$"(or relevant extension)- expected: tracked pointers listed.
- Optional pointer inspection:
git show :<path> | sed -n '1,3p'- expected:
version https://git-lfs.github.com/spec/v1
Reporting Template
Always report:
- remote/branch operated,
- exact tracking patterns currently active,
- whether README was updated,
- divergence before/after (
origin/main...main), - final push result and commit hash.