Scripts and scaffolding to fine-tune modern models with Hugging Face Transformers, optional LoRA adapters, and deploy behind a simple HTTP API.
Recommended baselines:
- Open + small (plug-and-play):
EleutherAI/pythia-410m-deduped(orEleutherAI/pythia-1.4b-deduped) - Bigger (requires access):
meta-llama/Llama-2-13b-hf(higher quality, heavier)
- Project Structure
- Quickstart
- Distributed Training Smoke Test (LoRA Delta Path)
- Deployment API
- Command Station (GUI)
- Notes
.
├── configs/ # YAML training/eval hyperparameters
├── data/ # Data prep + dataset utilities
│ ├── raw/ # (gitignored)
│ ├── processed/ # (gitignored)
│ └── lib/ # Dataset utilities (e.g., Wikipedia downloader)
├── scripts/ # Runnable entrypoints
│ ├── train.py
│ ├── evaluate.py
│ ├── command_station.py
│ ├── distributed_train.py
│ ├── hyperparameter_tuning.py
│ └── train_json.py
├── src/ # Importable library code
│ ├── models/ # Base model/tokenizer loading
│ ├── inference/ # Generation helpers
│ ├── cc/ # Command-and-control server
│ ├── node/ # Worker client
│ └── utils/ # Config helpers
├── deployment/ # Flask API server + Dockerfile
├── trainer/ # Distributed training scaffold (trainer loop)
├── policy_service/ # Policy service stub
├── actors/ # Actor client / worker / logger
├── npc/ # NPC agent loops (sense→think→act)
├── examples/ # Educational transformer implementation
├── openweb/ # Open WebUI + Ollama compose
├── results/ # (gitignored) training artifacts
├── models/ # (gitignored) checkpoints + final weights + adapters
└── logs/ # (gitignored) runs / metrics
pip install -r requirements.txtRecommended for large artifacts (datasets, checkpoints, logs):
python data/link_external.py --external-root /mnt/SSD1TB/ZYKE_DATAThis links data/raw, data/processed, models, results, logs → your mount.
python data/build_dataset.py \
--sources openwebtext,wikipedia \
--output-dir data/processed \
--max-total 100000 \
--weights openwebtext=1,wikipedia=1 \
--local-dir data/raw/ebooks \
--lang en \
--use-minhash \
--minhash-threshold 0.8python data/prepare_data.pypython data/make_experience_blocks.py \
--input-path data/processed \
--output data/processed/experience_blocks.jsonl \
--tokenizer EleutherAI/pythia-410m-deduped \
--seq-len 128 \
--steps-per-block 64 \
--env-id ebooks \
--npc-type genericpython scripts/train.py --starter-blocks data/starter_blocks --resume-latest --max-steps 50Train on a text corpus (from data/build_dataset.py):
python scripts/train.py \
--base-model EleutherAI/pythia-410m-deduped \
--train-txt data/processed/train.txt \
--val-txt data/processed/val.txt \
--resume-latestpython scripts/train.py \
--base-model EleutherAI/pythia-410m-deduped \
--starter-blocks data/starter_blocks \
--use-lora \
--adapter-name npc_core_pythia_410m_v1 \
--update-manifest \
--resume-latestConfigure targets via --lora-target-modules.
- TensorBoard:
--log-to tensorboard(logs underlogs/training) - Weights & Biases:
--log-to wandb(requiresWANDB_PROJECT+ token) - Hugging Face Hub push:
--hf-push
Note: LLaMA weights require Hugging Face access + token.
python scripts/evaluate.py \
--evals wikitext2,wikitext103,c4,lambada,piqa,hellaswag,winogrande,arcpython scripts/evaluate.py \
--evals wikitext2,lambada \
--adapter-path models/adapters/<name> \
--base-model meta-llama/Llama-2-13b-hfOr use:
python scripts/evaluate.py \
--evals wikitext2,lambada \
--adapter-name <name>--adapter-name is resolved via data/adapters/manifest.json.
Python:
from src.inference.generator import generate_npc_response
generate_npc_response(
...,
adapter_path="models/adapters/<name>",
base_model="meta-llama/Llama-2-13b-hf",
)Or use adapter_name (resolved via data/adapters/manifest.json).
Important: keep the base model consistent with what the adapter was trained on.
python -m src.inference.generator- Safe-mode is on by default; set
safe_mode=Falsefor raw output. - Optional quantization:
quantization="4bit"or"8bit".
python data/prepare_npc_schema.py \
--input data/npc_sample.jsonl \
--output-dir data/processedAPI_TOKEN=yourtoken python deployment/app.py/generate supports:
adapter_name(preferred; resolved viadata/adapters/manifest.json)adapter_version(optional cache-buster; use this when you update adapter weights on disk and want inference to reload)- MCP tools (optional): set
enable_tools=trueand passnpc_typeso the server can allowlist tools fromdata/mcp/tools.json
Trainer:
python -m trainer.serverTrainer persistence + ops:
- SQLite DB:
TRAINER_DB_PATH=models/trainer.db - Live logs (SSE):
GET /events
Worker(s):
python -m actors.worker --trainer-url http://localhost:5001 --node-id localTo run the LLM LoRA-delta backend (small baseline recommended):
TRAINER_BACKEND=llm LLM_ADAPTER_NAME=npc_core_pythia_410m_v1 python -m trainer.server
python -m actors.worker --trainer-url http://localhost:5001 --node-id local --mode llm --adapter-name npc_core_pythia_410m_v1To enqueue blocks (generated by data/make_experience_blocks.py) into the trainer queue:
curl -X POST http://localhost:5001/enqueue_blocks \
-H 'Content-Type: application/json' \
-d '{"dataset_label":"local","blocks_path":"data/processed/experience_blocks.jsonl","target_adapter":"npc_core_pythia_410m_v1","replicas":5,"best_k":1}'python scripts/command_station.pyThe GUI can:
- Start/stop the local worker (
mlporllmmode) - Build blocks locally and enqueue them to the trainer
- Stream trainer logs live from
GET /events - Chat with the inference API; optionally calls
POST /export_adapteron the trainer first and passesadapter_versionto force inference reload - (Optional) Tool-use planning: set
enable_toolsin your/generatepayload and run a local MCP tool server (see below)
Run the local stub server (replace tool handlers with real game state later):
python -m src.mcp.server_stubTool manifest lives at data/mcp/tools.json. Add/allowlist tools by npc_type (e.g., dog_guard).
Manifest path:
data/adapters/manifest.json
Publish adapter:
python scripts/publish_adapter.py \
--adapter-name <name> \
--manifest data/adapters/manifest.json \
--repo-id <user/repo>python scripts/merge_adapters.py \
--base-model ... \
--adapter-path ... \
--output models/merged/<name>If you import from src/ in your own scripts, set:
export PYTHONPATH=.Runs aggregation + timeout ticker:
python -m trainer.serverEnvironment knobs (env vars):
NUM_TASKS_PER_ROUND(default3)MIN_UPDATES_PER_ROUND(default1)ROUND_TIMEOUT_SEC(default30)MAX_STALENESS(default1)DELTA_NORM_MAX(default1e9)TICK_INTERVAL_SEC(default1)CHECKPOINT_DIR(defaultmodels/checkpoints)
Workers compute real deltas and submit updates:
python actors/worker.py --trainer-url http://localhost:5001 --num-tasks 3Worker flow:
GET /get_taskGET /get_lora_weights- Local train loop
- Save fp16 delta (
torch.save) POST /submit_updatewith metrics:train_loss_mean/lastgrad_norm_meanstepsdurationnum_samples
- After
NUM_TASKS_PER_ROUNDupdates (or timeout +MIN_UPDATES_PER_ROUND), aggregation runs. policy_versionincrements.- A new checkpoint is saved under
models/checkpoints.
python trainer/trainer_loop.pyThis can run standalone PPO on locally pulled experience blocks (polling endpoint placeholder in args).
Run:
API_TOKEN=yourtoken python deployment/app.pyGET /healthGET /metricsPOST /generate
JSON fields:
- Core:
persona,context,state,player_input - Optional model selection:
adapter_path,adapter_name,base_model - Generation settings:
max_new_tokens,temperature,top_p,top_k,num_beams - Behavior:
safe_mode,quantization - Batching:
{
"requests": [
{
"persona": "...",
"context": "...",
"state": "...",
"player_input": "..."
}
]
}Batch-level flags apply to the whole batch.
- License: CC BY-NC 4.0 (non-commercial). See
LICENSE. - Large datasets/checkpoints/logs are gitignored; use external storage for big artifacts.
- Deployment is a simple Flask example; swap to FastAPI (or your stack) as needed.
--log-to tensorboard→ logs underlogs/training--log-to wandb→ requiresWANDB_PROJECT+ token- Hub publish with
--hf-push
- Train with
--use-lora, choose a name via--adapter-name - Configure targets via
--lora-target-modules - Keep the base model consistent at train/eval/inference
- Safe-mode filtering for NPC JSON outputs (toggleable)
- Optional 4/8-bit quantization
- Adapter loading for modular skills
- Trainer rejects stale versions, bad shapes, NaN/inf metrics, and deltas above
DELTA_NORM_MAX - Ticker enforces
ROUND_TIMEOUT_SECaggregation
- Generator retries JSON parsing with temperature/top-p backoff
- Enforces allowed enums
audience=minordefault keeps rails/schema;audience=adultbypasses them
Samples:
data/alignment/npc_alignment_sample.jsonldata/alignment/npc_alignment_dataset.jsonl
- Optional flash-attention (
use_flash_attn) torch.compile(compile_model)- Quantization (4/8-bit)
- Per-process LRU caching
- Prometheus metrics at
/metrics - Rate limiting + timeouts
- Stricter auth default (
REQUIRE_API_TOKEN=true)
- Manifest at
data/adapters/manifest.json - Publish to Hub with
scripts/publish_adapter.py - Merge adapters with
scripts/merge_adapters.py
- Dockerfile uses
gunicorn+ healthcheck - Configurable defaults via:
DEFAULT_BASE_MODELDEFAULT_TOKENIZER_PATHDEFAULT_ADAPTER_NAMEDEFAULT_MANIFEST_PATH
- Concurrency caps and rate limiting enabled.