nostrain

Coordinator-free distributed ML training over Nostr relays.

Workers exchange compressed, Schnorr-signed pseudo-gradients through public WebSocket relays.
No central server. No custom infrastructure. Just DiLoCo over Nostr.

4 workers training in parallel over a local Nostr relay

_{4 workers exchanging gradients through a local relay — live round-by-round progress}

_{All workers converge to the same model — 97-99% loss reduction across data shards}

How it works

sequenceDiagram
    participant A as Worker A
    participant R as Nostr Relay(s)
    participant B as Worker B

    Note over A,B: All workers start from shared model snapshot

    loop Each Round
        A->>A: Local SGD (N inner steps)
        B->>B: Local SGD (N inner steps)

        A->>A: delta = params - initial
        B->>B: delta = params - initial

        A->>A: top-k → int8 → zlib → BIP340 sign
        B->>B: top-k → int8 → zlib → BIP340 sign

        A->>R: EVENT kind:33333 (gradient)
        B->>R: EVENT kind:33333 (gradient)

        R-->>A: Peer gradients
        R-->>B: Peer gradients

        A->>A: aggregate(weighted mean) → Nesterov outer step
        B->>B: aggregate(weighted mean) → Nesterov outer step
    end

Quick start

1. Install

python -m pip install -e .

2. Initialize a model

nostrain init-state --runtime linear-regression --features 3 -o model.json

3. Train across relays

# Run this on each machine with different data shards
nostrain run-training model.json data.json \
  --relay wss://relay.damus.io \
  --relay wss://nos.lol \
  --run my-experiment \
  --sec-key $NOSTR_SECRET_KEY \
  --rounds 5 \
  --inner-steps 80 \
  --outer-learning-rate 0.7 \
  --momentum 0.9 \
  -o trained.json

Workers discover each other via heartbeat events and sync automatically.

Demo: distributed GPT training over Nostr

The repo ships a complete demo that trains a character-level GPT on Shakespeare. Four workers each get a different slice of the text. No single worker sees the full corpus — they have to collaborate through the relay to learn the language.

python -m pip install -e ".[torch]"
bash demo/gpt/run.sh

Or run headless (no tmux, everything in-process):

PYTHONPATH=. python demo/gpt/train.py --rounds 5 --inner-steps 100

The model

The model in demo/gpt/model.py is a standard decoder-only transformer — the same architecture behind GPT-2, GPT-3, and every modern LLM, just smaller. Here's what's inside:

CharGPT (834K params)
├── tok_emb    Embedding(96, 128)     — maps each character to a 128-dim vector
├── pos_emb    Embedding(128, 128)    — learned position encodings (context = 128 chars)
├── blocks     4 × TransformerBlock
│   ├── ln_1   LayerNorm              — normalize before attention
│   ├── attn   CausalSelfAttention    — 4 heads, each 32-dim, masked so tokens
│   │          ├── c_attn  Linear(128, 384)  — project to Q, K, V in one shot
│   │          └── c_proj  Linear(128, 128)  — project attention output back
│   ├── ln_2   LayerNorm              — normalize before FFN
│   └── mlp    FFN                    — Linear(128, 512) → GELU → Linear(512, 128)
├── ln_f       LayerNorm              — final norm
└── head       Linear(128, 96)        — predict next character (96 = printable ASCII)

The input is a sequence of characters. Each character becomes a 128-dimensional embedding, gets a position encoding added, then flows through 4 transformer blocks. Each block runs causal self-attention (every token can only look at tokens before it — this is what makes it autoregressive) followed by a feed-forward network. The output is a probability distribution over the next character.

96 vocabulary tokens (printable ASCII), 128-dimensional embeddings, 4 layers, 4 attention heads, 128 context length. Total: 834,048 parameters. Small enough to train on a CPU in minutes, large enough to learn non-trivial language structure.

How the distributed training works

This is DiLoCo (Distributed Low-Communication Learning) — the same algorithm Google used to train language models across poorly-connected datacenters. The idea is beautifully simple:

Inner loop (local). Each worker trains independently on its own data shard using standard AdamW. This is regular gradient descent — nothing special. Each worker runs 100 steps, sees different text, learns different things.

Pseudo-gradient. After local training, each worker computes delta = trained_params - initial_params. This is the "pseudo-gradient" — a summary of what the worker learned in that round. Unlike real gradients (which are instantaneous slopes), pseudo-gradients capture the cumulative effect of many training steps.

Compression. The pseudo-gradient has 834K float values. We compress it:

Top-k sparsification (keep only the 30% largest values by magnitude)
int8 quantization (scale to [-127, 127])
zlib compression

Result: ~580KB per gradient event. This gets published as a Nostr event.

Transport. The compressed pseudo-gradient is packed into a NIP-01 event (kind 33333), signed with a BIP340 Schnorr signature, and published to the relay. Gradient and heartbeat events can also advertise the local example count so peers can weight uneven shards correctly. All standard Nostr — any relay works.

Outer loop (aggregation). Each worker subscribes to the relay, collects all peer gradients for the current round, decompresses them, and computes an example-weighted mean when peers advertise example counts, otherwise falling back to a plain worker mean. This aggregated pseudo-gradient is applied using Nesterov momentum — a second-order update that looks ahead before stepping, converging faster than plain SGD.

Then the cycle repeats.

What you'll see

The text evolves from random garbage to recognizable English over 5 rounds:

Round 0 (random init):

ROMEO:2NYPp@JTb<;2..qce[vP[qIto9TxFwIHb)D~?>o9[**c!$/?Z"yiFxy

Round 1 (after first sync):

ROMEO: Maf A1Exs w molinounan, sthat pine ted mes I chat, y hethanalher

Round 3:

ROMEO: Whe do he sthe senond pare at o pro ther fakis clotinthont se path

Round 5:

ROMEO: d I shes mear to the ce withat, tre so ther wisho sheath. do wiso
me ifon ithid An w'd onke r wour sple y thenoreancpe ay ak, hire d hie

Loss drops from ~4.5 to ~2.4. The model starts recognizing common English patterns — "the", "that", "ther", "and" — and begins forming word-like structures. It's not Shakespeare yet (you'd need more parameters and more training), but the trajectory is clear: the four workers, each seeing only 25% of the text, collaboratively learn the structure of the language by exchanging compressed updates through a Nostr relay.

How nostrain maps to the training loop

Each round in the demo executes this exact sequence of nostrain operations:

# 1. Publish heartbeat so peers can discover us
heartbeat = build_heartbeat_event(metadata, secret_key_hex=key)
await publish_nostrain_events(relay_urls, heartbeat)

# 2. Train locally with PyTorch (standard autograd)
optimizer = AdamW(model.parameters(), lr=3e-4)
for step in range(100):
    x, y = dataset.get_batch(32)
    loss = model(x, y)
    loss.backward()
    optimizer.step()

# 3. Compute pseudo-gradient: what did local training change?
trained_state = model_state_from_module(model)   # nn.Module → ModelState
delta = compute_delta(initial_state, trained_state)

# 4. Compress and publish as signed Nostr event
payload = compress_delta(delta, topk_ratio=0.3)   # 834K → 250K values, int8
event = build_gradient_event(payload, metadata, secret_key_hex=key)
await publish_nostrain_events(relay_urls, event)   # → relay

# 5. Collect peer gradients from relay
collection = await collect_gradient_events_across_relays(
    relay_urls, run_name=run, round_index=round_idx,
    idle_timeout=12.0, strategy="timeout", discover_workers=True,
)

# 6. Aggregate and apply outer step
outer = nesterov_outer_step(
    initial_state, collection.aggregate_delta(),
    learning_rate=0.7, momentum=0.9,
)

# 7. Load updated state back into PyTorch model
load_state_into_module(outer.next_state, model)

The PyTorch training (step 2) is completely standard — nostrain doesn't touch your model architecture, optimizer, or loss function. It only cares about the state before and after. Everything between model_state_from_module and load_state_into_module is framework-agnostic: the compression, signing, relay transport, aggregation, and outer step all operate on flat parameter tensors. You could swap PyTorch for MLX, JAX, or a pure-numpy implementation and the transport layer wouldn't change.

A simpler demo (linear regression)

If you want something faster to verify the setup, there's also a linear regression demo:

bash demo/run.sh

4 workers learn y = 3x₁ - 1.5x₂ + 0.5x₃ + 1 from non-overlapping data shards. Takes about 60 seconds, all workers converge to the true weights. Same DiLoCo loop, much smaller model (4 parameters).

Compression pipeline

pseudo_gradient = params - initial
        │
        ▼
  top-k sparsification ─── keep k% largest values
        │
        ▼
  int8 quantization ────── scale to [-127, 127]
        │
        ▼
  NSTR wire format ─────── magic + sparse index layout
        │
        ▼
  zlib/zstd ────────────── compressed bytes
        │
        ▼
  base64 → Nostr event content

A 10k-parameter gradient at topk=0.1 compresses to ~1KB.

Nostr protocol

Three NIP-01 event kinds, all BIP340 Schnorr signed:

Kind	Type	Content
`33333`	Gradient	Compressed pseudo-gradient payload (base64)
`33334`	Heartbeat	Empty — capabilities, relay hints, and optional example counts in tags
`33335`	Checkpoint	Serialized training state for recovery

Works with any relay that indexes on kind and #t tags.

Gradient and heartbeat tags may include examples=<count>. When present on gradients, nostrain uses those counts for weighted aggregation across uneven worker shards.

Configuration

`run-training` flags

Flag	Default	Description
`--relay`	required	WebSocket relay URL (repeatable)
`--run`	required	Shared run name across workers
`--sec-key`	required	Hex Nostr secret key
`--rounds`	`1`	Number of outer rounds
`--inner-steps`	`500`	Local SGD steps per round
`--local-learning-rate`	`0.01`	Inner loop learning rate
`--outer-learning-rate`	`0.7`	DiLoCo outer step learning rate
`--momentum`	`0.9`	Nesterov outer momentum
`--batch-size`	`1`	Mini-batch size
`--topk`	`1.0`	Gradient sparsity (0.1 = keep 10%)
`--round-timeout`	`2.0`	Seconds to wait for peer gradients
`--backend`	`python`	`python`, `numpy`, or `torch`
`--resume-latest-checkpoint`	—	Rejoin from relay-distributed checkpoint

Sync strategies (for `collect-events` / `aggregate-round`)

Strategy	Behavior
`timeout`	Aggregate whatever arrives within N seconds
`strict`	Wait for exactly N workers
`quorum`	Wait for majority of discovered workers
`async`	Return immediately with local gradient only

Fault tolerance

Multi-relay — publish to N relays, collect from all, deduplicate by event fingerprint
Retry + backoff — configurable exponential backoff on transient failures
Late gradients — fold into next round (deferred) or record-only (discard)
Checkpoint recovery — resume from local file or discover latest from relay
Rolling retention — bound relay-visible checkpoint history per worker

State formats

nostrain convert-state model.json -o model.npz       # NumPy archive
nostrain convert-state model.json -o model.pt.npz     # PyTorch state-dict archive
nostrain convert-state model.json -o model.pt         # Native torch.save

PyTorch import auto-handles module.* prefixes, state_dict/model_state_dict wrappers, and nested checkpoint bundles.

CLI

nostrain init-state             Initialize model state for a built-in runtime
nostrain hash-state             Deterministic model hash
nostrain convert-state          Convert between JSON / npz / pt formats

nostrain encode-delta           Compress a pseudo-gradient
nostrain decode-payload         Decompress a payload
nostrain apply-payload          Reconstruct state from base + payload
nostrain aggregate-payloads     Aggregate multiple worker payloads

nostrain outer-step             Apply DiLoCo outer step with momentum
nostrain train-local            Run inner SGD loop locally

nostrain build-event            Build signed gradient event
nostrain build-heartbeat        Build signed heartbeat event
nostrain build-checkpoint       Build signed checkpoint event
nostrain inspect-event          Validate and inspect an event

nostrain publish-event          Publish to relay(s)
nostrain collect-events         Collect round events from relay(s)
nostrain aggregate-round        Collect + aggregate in one step
nostrain discover-workers       List active workers
nostrain discover-checkpoints   Find latest checkpoint
nostrain derive-pubkey          Derive pubkey from secret key

nostrain run-training           Full distributed training session

Python API

from nostrain import (
    GradientEventMetadata,
    build_gradient_event,
    compress_delta,
    compute_delta,
    state_digest,
)

# Compress a pseudo-gradient
delta = compute_delta(initial_state, trained_state)
payload = compress_delta(delta, topk_ratio=0.1)

# Publish as a signed Nostr event
metadata = GradientEventMetadata(
    run_name="experiment-1",
    round_index=0,
    worker_id=worker_pubkey,
    model_hash=state_digest(initial_state),
    inner_steps=100,
)
event = build_gradient_event(
    metadata,
    payload,
    secret_key_hex=secret_key_hex,
)

Optional dependencies

Extra	Package	Enables
`numpy`	`numpy>=1.26`	`.npz` state I/O, NumPy training backend
`torch`	`torch>=2.1`	`.pt`/`.pth` checkpoints, torch training backend
`zstd`	`zstandard>=0.22`	zstd compression (default: zlib)

python -m pip install -e ".[numpy,torch,zstd]"

Design decisions

Why Nostr? Public relay infrastructure already exists — WebSocket pub/sub at scale with cryptographic identity built in. Zero servers to deploy.

Why pure-Python crypto? BIP340 Schnorr signatures using only hashlib. No compiled extensions, installs everywhere.

Why framework-agnostic transport? The wire protocol never imports torch or numpy. Framework code lives at the edges and is entirely optional.

Project docs

Architecture and operational notes: docs/ARCHITECTURE.md
Contributor workflow: CONTRIBUTING.md
Changelog: CHANGELOG.md
Agent context: AGENTS.md

Development

python -m pip install -e ".[dev,numpy]"
make lint
make test
make coverage

Relay changes are the slowest path in the repo. Run this suite before shipping transport or checkpoint changes:

python -m pytest tests/test_relay.py -q

License

This project is licensed under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
demo		demo
docs		docs
src/nostrain		src/nostrain
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nostrain

How it works

Quick start

Demo: distributed GPT training over Nostr

The model

How the distributed training works

What you'll see

How nostrain maps to the training loop

A simpler demo (linear regression)

Compression pipeline

Nostr protocol

Configuration

`run-training` flags

Sync strategies (for `collect-events` / `aggregate-round`)

Fault tolerance

State formats

CLI

Python API

Optional dependencies

Design decisions

Project docs

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nostrain

How it works

Quick start

Demo: distributed GPT training over Nostr

The model

How the distributed training works

What you'll see

How nostrain maps to the training loop

A simpler demo (linear regression)

Compression pipeline

Nostr protocol

Configuration

run-training flags

Sync strategies (for collect-events / aggregate-round)

Fault tolerance

State formats

CLI

Python API

Optional dependencies

Design decisions

Project docs

Development

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run-training` flags

Sync strategies (for `collect-events` / `aggregate-round`)

Packages