Skip to content

SP8192 Byte-PPM O=5 + V6 micro, 3-seed mean 0.92967555 BPB#2076

Open
teslaeco wants to merge 30 commits intoopenai:mainfrom
Terraforming-Planet:final-pr1991-v6-0929675
Open

SP8192 Byte-PPM O=5 + V6 micro, 3-seed mean 0.92967555 BPB#2076
teslaeco wants to merge 30 commits intoopenai:mainfrom
Terraforming-Planet:final-pr1991-v6-0929675

Conversation

@teslaeco
Copy link
Copy Markdown

@teslaeco teslaeco commented May 1, 2026

This submission is based on PR1991 (SP8192 Byte-PPM O=5) with a minimal train-only dataset modification.

Key result:

  • 3-seed mean ppm_mixer val_bpb: 0.92967555
  • seeds: 42, 314, 999
  • size: ~15.92MB (within 16MB limit)

Modification:

  • V6 Privacy-Web-Filtering dataset used as a small train-only sparse micro-injection (8192 tokens)
  • injected only into training shard
  • no modification of FineWeb validation
  • tokenizer unchanged (SP8192)

Repro:

  • included rebuild_and_run_v6_micro_8xh100.sh
  • includes logs and manifests
  • full reproducibility from records folder

Notes:

  • validation is strictly official FineWeb
  • no leakage
  • result improves over PR1991 reported baseline

Logs included for all 3 seeds.

teslaeco and others added 30 commits April 5, 2026 18:18
INT8 compressed model (~9MB) for cube-letter assignment task.
Fits within 16MB limit for Parameter Golf submission.
Prepare model for Parameter Golf submission
INT8 compressed model (~9MB) for cube-letter task
This README provides details about a non-record submission for the OpenAI Parameter Golf challenge, including key results, training configuration, model architecture, optimization details, and compression methods used.
This ensures the submission stays under the **16MB limit**.

---

## Evaluation

Evaluation uses tokenizer-aware byte accounting:

- Metric: **bits-per-byte (BPB)**
- Validation: full FineWeb validation split
- Exact values reported after quantization roundtrip

---

## Included Files

- `README.md` – run documentation
- `submission.json` – metadata for evaluation
- `results.tsv` – structured results
- `final_model.int8.ptz` – compressed model artifact
- `train_gpt.py` – training script
- `train.log` – training logs

---

## Notes

- This run serves as a **strong baseline** for further optimization.
- Key improvement lever: **entropy reduction (BPB)** rather than longer training.
- Future directions:
  - architecture refinement
  - tokenizer-aware improvements
  - compression-aware training

---

## Status

- ✅ Valid non-record submission
- ❌ Not optimized for record track yet
- 🎯 Competitive baseline for further iteration
…-submission-metadata-files

Fix corrupted metadata for V5 non-record submission
Add non-record V5 SP1024 Seq4096 1xH100 submission
Tighten metadata for V5 non-record submission
Add runpod_record_attempt.sh to automate multi-GPU, multi-seed SOTA run
Add probe setup script for FineWeb caching and auxiliary V6 dataset prep
Add near-SOTA SP8192 LegalTTT 3-seed reproduction
Clean SP8192 LegalTTT reproduction metadata
Fix V8 dataset paths and RunPod probe script
Add W104 faithful SP8192 LegalTTT bad-seed probe
@teslaeco
Copy link
Copy Markdown
Author

teslaeco commented May 1, 2026

Raw logs are included for all three seeds, but the final summary file contains small transcription mismatches for seed 314 and seed 999. The authoritative values are the ppm_mixer val_bpb values in train_seed*.log: seed42 0.92982823, seed314 0.92917762, seed999 0.92987519; raw-log mean 0.9296270133. I am updating the summary to match the raw logs and clarifying that these are the source-of-truth run outputs.

@teslaeco
Copy link
Copy Markdown
Author

teslaeco commented May 1, 2026

I want to be completely honest.

I’m not a professional ML engineer. I’m an independent researcher doing this out of passion. I like OpenAI competitions and I take part in different challenges even when I’m still learning — I just go for it.

During this submission, I relied heavily on ChatGPT, and some of the guidance I followed turned out to be misleading. For example, I was convinced that key logs were properly saved, but in reality what got included is not what I expected. That’s on me for trusting it too much without verifying everything deeply.

I’m not going to pretend I’m an expert. I did this because I enjoy it and I wanted to push myself.

What I can say is that I put a huge amount of time into these trainings and experiments. This wasn’t random — it was real work, real effort, and real iteration.

I hope that even if the submission has issues, some part of my contribution is still useful. And I genuinely hope you are able to run or inspect the result in some way.

@cocohearts
Copy link
Copy Markdown
Collaborator

Leaderboard audit note (pre-cutoff state): I don't think this is valid as a record row. The byte-PPM score does not provide a normalized distribution over the official next-token alphabet before seeing the realized token; it scores the realized byte stream/mixer path. The logs also indicate byte accounting inconsistent with the official validation byte denominator, so the headline 0.9296 BPB is not acceptable leaderboard evidence.

@teslaeco
Copy link
Copy Markdown
Author

teslaeco commented May 5, 2026

@cocohearts
Thanks for the comment and the honest evaluation.

I also want to add something from my side
I’m more of an artist and technic than an ML engineer, and this competition is new to me. I was doing many things for the first time, and I didn’t fully understand everything as well as I should have.

While working on this submission, I relied heavily on chat assistance and asked it to follow the competition rules, but unfortunately not everything went as I expected. At some points I was convinced everything was compliant, but in the end it turned out not to be fully the case.

On top of that, I had real technical issues
especially with downloading the main logs. My internet was working fine, but the logs were not downloading properly. Later I thought they had been included, but it turned out they weren’t, and by that time the pod had already been stopped and removed.

I even reran the training specifically to reproduce and fix it, but eventually I ran out of time and budget.

I’m not going to pretend I did everything perfectly
if something is wrong, it means I didn’t fully understand it yet or didn’t manage to align properly with the tools I was using.

I treat this as a learning experience. Thanks for the feedback
it really means a lot to me.😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants