name: nousnet-distributed-training
description: >-
Integrate the Psyche/nousnet distributed training protocol into Experience
mesh for federated SLM training across untrusted LAN/internet peers.
Apache 2.0, Rust-primary (77.8%). Key mechanisms: DisTrO optimizer (DCT-TopK
sparse gradient compression), QUIC+gossip P2P via iroh, witness quorum
integrity (bloom-filter proofs, 2/3 majority), Solana compute-point rewards.
Use when planning federated training across the Experience mesh (e.g. 2x6GB
GPU peers sharing training load for a Magicborn SLM). Foundation tier — this
is a strategic platform capability, not a one-off script.
status: stable
lane: [slm, training, mesh, infra]
type: platform-integration
trigger: >-
"nousnet", "psyche distributed training", "federated training", "DisTrO",
"mesh training", "p2p training", "gradient compression", "experience training"
source-ingest: .planning/ingest/2026-06-07-article-github-com-psychefoundation-nousnet.md
port-score: strategic (Apache 2.0, active port candidate)
argument-hint: --mode centralized|decentralized --peers
nousnet-distributed-training
Federated transformer training across Experience mesh peers. Based on PsycheFoundation/nousnet (Apache 2.0, Rust-primary).
Adoption type: wrap + integrate — Apache 2.0 allows direct use. Do NOT fork unless adding mesh-specific extensions (e.g. Experience economy token integration). Consume the published crate or build from source.
Architecture (Psyche/nousnet)
Operator Peer A (6GB GPU) Operator Peer B (6GB GPU)
┌─────────────────────────┐ ┌─────────────────────────┐
│ Data shard (disjoint) │ │ Data shard (disjoint) │
│ Forward + backward │ │ Forward + backward │
│ DisTrO optimizer │ │ DisTrO optimizer │
│ → sparse DCT result │ ← iroh → │ → sparse DCT result │
└─────────────────────────┘ QUIC └─────────────────────────┘
│ │
└──────────── Coordinator ──────────┘
(TCP or Solana)
Witness quorum (2/3 bloom-filter)
DisTrO optimizer (the key bandwidth savings)
Wire format per training round:
// crates/shared/modeling/src/distro.rs
SerializedDistroResult {
sparse_idx: Vec<u32>, // TopK DCT component indices
sparse_val: Vec<f32>, // TopK DCT component amplitudes
xshape: Vec<u16>, // original tensor shape
totalk: u32, // total K selected
}
TransmittableDistroResult {
step: u64,
trainer_nonce: [u8; 32],
batch_id: u64,
distro_results: Vec<SerializedDistroResult>,
}
Bandwidth reduction: only TopK DCT components transmitted, not full gradients. Integrity: SHA-256 over (step, batch_id, sparse tensors).
Coordinator states
Uninitialized → WaitingForMembers → Warmup → RoundTrain → RoundWitness
→ Cooldown → Finished | Paused
Dual backend:
- TCP centralized (
architectures/centralized/) — simpler; trust required - Solana on-chain (
architectures/decentralized/) — trustless; rewards via token
Witness quorum (integrity without trust)
- Bloom filter: 1024-bit (16x u64), 1% false-positive rate
- Threshold: 2/3 of elected witnesses must submit valid proofs before state advances
- Max 256 clients, max 32 witnesses per round
- Client states:
Healthy | Dropped | Withdrawn | Ejected
Experience mesh integration plan
| Psyche component | Experience equivalent | Integration |
|---|---|---|
| Coordinator (TCP) | Experience control plane (127.0.0.1:47800) | Wire coordinator into control plane HTTP API |
| iroh QUIC P2P | Experience mesh P2P transport (libp2p future) | Use iroh as transport for training traffic |
| Solana treasury | Experience economy (XP compute points) | Map Psyche compute_point rewards → XP economy earn |
| Witness quorum | No equivalent | Adopt directly — ensures gradient integrity |
| DisTrO optimizer | No equivalent | Adopt directly — bandwidth reduction is critical on LAN |
Workflow — adding a training peer
# On each peer (has Experience mesh installed):
xp mesh training join \
--coordinator <coordinator-url> \
--data-shard <path-to-shard> \
--gpu-memory 6G
# Monitor:
xp mesh training status
xp mesh training metrics # bandwidth, round progress, witness quorum state
Limitations (from RE)
- Max 256 clients, max 32 witnesses — sufficient for LAN / small internet mesh
- Solana backend requires Solana account + rent; TCP backend is simpler for local
- Windows: Solana sandbox weakens on Windows (no user namespaces); use TCP mode
- DisTrO DCT basis matrices pregenerated at startup; first-round latency is higher
Anti-patterns
- Do NOT run both coordinator backends simultaneously — choose TCP (local/trusted) OR Solana (internet/trustless)
- Do NOT exceed 256 clients without forking — the Solana program has this hard cap
- Do NOT use Python wrappers — Psyche's Rust core is the right integration point; Python client examples are reference only
- Do NOT skip witness quorum in distributed mode — integrity guarantees collapse without it; TCP mode is acceptable for trusted-peer LAN training
Provenance
- Source: PsycheFoundation/nousnet (Apache 2.0) —
.planning/ingest/2026-06-07-article-github-com-psychefoundation-nousnet.md - Tier: foundation / strategic
- Adoption type: wrap + integrate; Apache 2.0; direct use allowed
- Relationship to existing work: nousnet ingest already in memory as
project_nousnet_remote_slm_training_plugin.md - Built: 2026-06-07