Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# BIJEPAX-lite 3-seed candidate

Config matches successful seed 42 run:
- script: our_submission/train_gpt_v15_bijepax.py
- DISABLE_COMPILE=1
- CASEOPS_ENABLED=1
- PPM_MIXER_ENABLED=1 order=5 H=0.999 L=0.18 T=0.80
- TTT_ENABLED=0
- LQER_TOP_K=1
- BIJEPAX_ENABLED=1 weight=0.01 start=0.35 end=0.80 fwd_hops=4 bwd_hops=4 cycle=0 head_dim=32 stride=64 lr=0.001

Existing seed 42:
- run: v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405
- final ppm_sliding val_bpb: 0.97234287
- artifact bytes: 15997180
- eval time: 502131ms
- rc: 0

Queued seeds: 314, 999


## v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209
- started: 2026-05-01T02:52:09Z
- log: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log
- finished: 2026-05-01T03:14:07Z
- rc: 0
- scores: diagnostic pre-quantization post-ema val_loss:2.42899363 val_bpb:1.10988323 eval_time:9910ms;Total submission size quantized+pergroup: 15999539 bytes;diagnostic quantized val_loss:2.44155528 val_bpb:1.11562304 eval_time:9926ms;ppm_mixer val_bpb:0.97206308 eval_time:453715ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499;ppm_sliding val_loss:2.45044876 val_bpb:0.97206308 eval_time:499038ms;

## v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407
- started: 2026-05-01T03:14:07Z
- log: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log
- finished: 2026-05-01T03:36:01Z
- rc: 0
- scores: diagnostic pre-quantization post-ema val_loss:2.43314506 val_bpb:1.11178015 eval_time:9911ms;Total submission size quantized+pergroup: 15997593 bytes;diagnostic quantized val_loss:2.44582432 val_bpb:1.11757370 eval_time:11393ms;ppm_mixer val_bpb:0.97373767 eval_time:451054ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499;ppm_sliding val_loss:2.45502055 val_bpb:0.97373767 eval_time:496384ms;

## Final scrape
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log: artifact_dir: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log: logfile: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209.txt
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log: model_path: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/final_model.pt
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log: quantized_model_path: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/final_model.int6.ptz
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log: run_id: v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log:Total submission size quantized+pergroup: 15999539 bytes
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log:diagnostic quantized val_loss:2.44155528 val_bpb:1.11562304 eval_time:9926ms
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log:ppm_mixer val_bpb:0.97206308 eval_time:453715ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s314_20260501_025209/train.log:ppm_sliding val_loss:2.45044876 val_bpb:0.97206308 eval_time:499038ms
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log: artifact_dir: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log: logfile: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405.txt
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log: model_path: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/final_model.pt
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log: quantized_model_path: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/final_model.int6.ptz
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log: run_id: v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log:Total submission size quantized+pergroup: 15997180 bytes
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log:diagnostic quantized val_loss:2.44116551 val_bpb:1.11544494 eval_time:10342ms
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log:ppm_mixer val_bpb:0.97234287 eval_time:456845ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s42_20260501_022405/train.log:ppm_sliding val_loss:2.45118426 val_bpb:0.97234287 eval_time:502131ms
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log: artifact_dir: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log: logfile: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407.txt
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log: model_path: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/final_model.pt
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log: quantized_model_path: /workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/final_model.int6.ptz
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log: run_id: v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log:Total submission size quantized+pergroup: 15997593 bytes
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log:diagnostic quantized val_loss:2.44582432 val_bpb:1.11757370 eval_time:11393ms
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log:ppm_mixer val_bpb:0.97373767 eval_time:451054ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
/workspace/parameter-golf/our_submission/1000/runs/v15_bijepaxlite_lqer1_nocompile_s999_20260501_031407/train.log:ppm_sliding val_loss:2.45502055 val_bpb:0.97373767 eval_time:496384ms
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Legality Audit - BIJEPAX-lite

## Verdict

Current read: **likely legal/submittable**, assuming the existing CaseOps byte-sidecar/PPM lane is accepted.

The BIJEPAX-lite addition itself is low-risk because it is training-only and has no evaluation-time access to future validation tokens.

## Challenge rules checked

From the repository README:

- Submission artifact size is code bytes plus compressed model bytes.
- The cap is strict decimal `16,000,000` bytes.
- Evaluation may not use training data unless paid for inside the artifact.
- Validation data may not be used during training.
- Evaluation must complete within 10 minutes on 8xH100, separate from the 10-minute training cap.
- Test-time methods must score before updating on validation tokens.

## Artifact size

Seed 42:

- `Serialized model quantized+pergroup: 15955181 bytes`
- `Total submission size quantized+pergroup: 15997180 bytes`
- Strict cap: `16000000 bytes`
- Headroom: `2820 bytes`

This is tight but under cap.

`LQER_TOP_K=1` was used specifically to create byte headroom. Earlier BIJEPA without this trim packaged at `16,004,902` bytes and was not submittable.

## Training-only JEPA auxiliary

Relevant implementation:

- `class MultiDirectionalBiJEPAX`
- `def bijepax_weight_at`
- `train_model(...): bijepax_module = MultiDirectionalBiJEPAX(...)`
- `step_fn(...): loss = ce_loss + bijepax_module(hidden, ...)`

The predictor module is created outside `base_model`:

```python
bijepax_module = MultiDirectionalBiJEPAX(...).to(device).bfloat16()
bijepax_opt = torch.optim.Adam(bijepax_module.parameters(), ...)
```

It is not assigned as a child module of `base_model`, so `base_model.state_dict()` does not contain BIJEPAX predictor weights.

Serialization only saves `base_model.state_dict()`:

```python
torch.save(base_model.state_dict(), h.model_path)
sd_cpu = _unbank_state_dict(base_model.state_dict(), h.num_layers)
```

So the JEPA predictor heads are not present in the final artifact.

## No validation leakage during training

Training batches come from `DocumentPackingLoader(h, device)`.

Validation data is loaded for periodic/terminal validation, but the BIJEPAX training loss only uses hidden states from training microbatches:

```python
x, y, cu_seqlens, _max_seqlen = train_loader.next_batch(...)
ce_loss, hidden = forward_with_hidden(x, y, ...)
loss = ce_loss + bijepax_module(hidden, ...)
```

The BIJEPAX module does not read validation tokens, validation bytes, or validation sidecars.

## Evaluation path

Final score uses the existing PPM sliding evaluator:

- `eval_val_ppm_sliding`
- `ppm_mixer val_bpb`
- `ppm_sliding val_loss / val_bpb`

The PPM mixer operates score-before-update over the scored target stream. The implementation computes neural log probabilities first, then the byte mixer walks bytes in order and updates its tables after scoring each byte.

Legality risk is therefore concentrated in whether reviewers accept this existing PPM/CaseOps scoring lane, not in BIJEPAX-lite.

## Cross-document leak check

The SmearGate cross-document leak fix is present in both hidden and TTT paths:

```python
not_bos = (input_ids[:, 1:] != BOS_ID).to(x.dtype).unsqueeze(-1)
x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1] * not_bos], dim=1)
```

TTT is disabled for this candidate (`TTT_ENABLED=0`), but the symmetric fix is still present.

## Eval compile

The run uses `DISABLE_COMPILE=1`. Post-serialize evaluation also honors this:

```python
if os.environ.get("DISABLE_COMPILE", "0") == "1":
log("eval_compile:disabled_by_env")
compiled_model = eval_model
compiled_forward_logits = eval_model.forward_logits
```

This avoids the compile stall encountered in the first BIJEPAX attempts.

## Risks / reviewer-facing caveats

- The artifact headroom is only `2820` bytes on seed 42. Do not add substantial code unless compression is rechecked.
- The PR should avoid unverifiable claims such as "BiJEPA proved 4x better on chaotic systems" unless the exact source is provided.
- The submission should clearly say the JEPA module is an auxiliary training regularizer, not an eval-time bidirectional predictor.
- If the competition reviewers consider the PPM/CaseOps byte-sidecar lane non-compliant, this candidate inherits that risk.
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# BIJEPAX-lite JEPA + SP8192 CaseOps PPM

This record submits a Claude-designed, JEPA-inspired training-only auxiliary regularizer on top of the SP8192 CaseOps + per-group compression + PPM sliding stack.

The final 3-seed mean is:

```text
ppm_sliding val_bpb: 0.97271454
```

## Results

| Seed | Final `ppm_sliding val_bpb` | Quantized diagnostic | Artifact bytes | Train stop | Eval time | Exit |
|---:|---:|---:|---:|---:|---:|---:|
| 42 | `0.97234287` | `1.11544494` | `15,997,180` | `2014` steps / `599.843s` | `502.131s` | `0` |
| 314 | `0.97206308` | `1.11562304` | `15,999,539` | `2012` steps / `599.586s` | `499.038s` | `0` |
| 999 | `0.97373767` | `1.11757370` | `15,997,593` | `2013` steps / `599.821s` | `496.384s` | `0` |

Three-seed sample std: `0.00089703`.

All three runs are under:

- strict decimal `16,000,000` byte artifact cap
- 600s training cap
- 600s evaluation cap

## What is new

BIJEPAX-lite adds a small custom JEPA-style hidden-state prediction objective during training:

- hop-4 forward hidden-state prediction
- hop-4 backward hidden-state prediction
- cosine embedding-space loss
- LayerNorm-stabilized predictor heads
- no cycle head in the submitted lightweight config
- active only from `35%` to `80%` of the wallclock schedule
- separate optimizer and separate module from the base GPT

The predictor heads are **not serialized**. Final scoring is performed by the quantized base model with the existing causal PPM sliding evaluator.

## Compliance notes

- `TTT_ENABLED=0`
- `LQER_TOP_K=1` keeps all seeds below the strict byte cap
- SmearGate BOS masking is present for packed-document cross-boundary safety
- BIJEPAX-lite trains only on training batches from `DocumentPackingLoader`
- BIJEPAX-lite does not access validation tokens or validation byte sidecars during training
- Final score is from `ppm_sliding`

The folder includes:

- `train_gpt.py`
- three seed logs
- full source/log captures for each seed
- `submission.json`
- `LEGALITY_AUDIT.md`
- `STATIC_AUDIT_NOTES.md`
- `REFERENCES.md`
- `JEPA.mp4` as a short visual/demo asset

## Acknowledgements

Thanks to Claude for designing the custom BIJEPAX-lite auxiliary objective and helping turn the JEPA idea into a runnable candidate. Thanks to Codex for implementing the run path, auditing legality, coordinating the 3-seed package, and assembling this PR. Thanks also to the Parameter Golf community for the public ideas and fast iteration that this stack builds on.

## Validation

- `python3 -m py_compile records/track_10min_16mb/2026-05-01_BIJEPAXLite_JEPA_PPM_0.97271/train_gpt.py`
- `python3 -m json.tool records/track_10min_16mb/2026-05-01_BIJEPAXLite_JEPA_PPM_0.97271/submission.json`
- 3 full remote runs on 8xH100 completed with `rc=0`
Loading