[WIP] Compression-aware fixed-step research by JusticeShultz · Pull Request #33 · openai/parameter-golf

JusticeShultz · 2026-03-18T23:08:30Z

Summary

This is a WIP local research branch for Parameter Golf focused on improving post-roundtrip val_bpb under the actual artifact constraint, using 1x RTX 3090 for local search before moving to the target 8xH100 environment.

The biggest shift in this branch is methodological: local ranking moved away from noisy short wallclock runs and onto a fixed-step exact roundtrip track. That made the local loop much more trustworthy and changed the research conclusions substantially.

The current branch focus is now:

compression-aware training
export-aware regularization
dense near-cap scaling
fixed-step post-roundtrip evaluation

As of March 19, 2026, train_gpt.py remains under the repo hard cap at 1492 lines.

What This Branch Adds

Compression-aware / export-aware training knobs

COMPRESSION_REG_WEIGHT
COMPRESSION_GRID_REG_WEIGHT
COMPRESSION_SCALE_REG_WEIGHT
COMPRESSION_RANK1_REG_WEIGHT
TERNARY_REG_WEIGHT
OUTLIER_REG_WEIGHT

Architecture / export knobs

NUM_UNIQUE_BLOCKS
WINDOW_SIZE
EMBED_DIM
INT8_AXIS_MODE
INT8_RESIDUAL_RANK
INT8_RESIDUAL_BUDGET_BYTES

Hybrid eval-time knobs

EVAL_CACHE_MIX_WEIGHT
EVAL_BIGRAM_MIX_WEIGHT
EVAL_CACHE_SIZE

Local evaluation / search controls

VAL_MAX_TOKENS
ROUNDTRIP_VAL_MAX_TOKENS
FINAL_ROUNDTRIP_EVAL
fixed-step local runs via ITERATIONS
dedicated local sweep launchers for:
- fixed-step roundtrip comparisons
- export-aware sweeps
- iso-byte dense sweeps
- high-cap dense width/depth sweeps

What Has Been Tried

1. Matched local roundtrip baseline

Run: baselinert3090_20260318_181344
Exact final roundtrip: val_bpb=2.11089617
Artifact: 6,705,058 bytes

2. Compression-aware baseline

Run: compressrt3090_20260318_175828
Knobs: COMPRESSION_REG_WEIGHT=0.005
Exact final roundtrip: val_bpb=2.06085837
Artifact: 6,839,798 bytes

This was the first clear local win over the matched baseline.

3. Fixed-step methodology pivot

The local wallclock track turned out to be too noisy to trust for small deltas, so ranking moved to a fixed-step exact roundtrip track.

Dense fixed-step control:

Run: fixedsteprtsweep_20260318_221632_base_a
Exact final roundtrip: val_bpb=2.04299145

This became the new local control.

4. Export-aware compression regularization

Best export-aware probe:

Run: exportaware_fixedstep_20260318_223456_g010_r000
Knobs:
- COMPRESSION_REG_WEIGHT=0.005
- COMPRESSION_GRID_REG_WEIGHT=0.10
Exact final roundtrip: val_bpb=2.04288777
Artifact: 6,663,470 bytes

This slightly but repeatably improved the fixed-step control.

Follow-up export-aware checks:

scale-aware regularization: regressed
nearby grid weights (0.08, 0.12): regressed
tiny outlier suppression: regressed

5. Branches that did not currently win locally

These were tested and are currently parked:

hybrid eval-time sidecar
sparse attention
recurrent/shared-block variants
ternary/low-bit shaping
residual-budget tuning
factorized/reused-block variants in the tested regime

Important nuance: these negative results were gathered before or outside the stronger near-cap dense regime, so they are not being treated as globally dead ideas.

6. Iso-byte dense frontier sweep

This changed the branch direction significantly.

Results:

b10 -> 2.02814871 at 9,683,932 bytes
b12 -> 2.05262920 at 11,334,608 bytes
b14 -> 2.03768242 at 13,094,288 bytes
b155 -> 2.00290272 at 13,741,308 bytes

This showed that simply spending more of the byte budget on a dense compression-aware model mattered much more than most under-cap micro-ideas.

7. High-cap dense width/depth frontier

Recovered / rerun near-cap results:

w608_l12 -> 2.00551677 at 14,371,393 bytes
w624_l12 -> 2.01128088 at 15,024,114 bytes
d576_l14 -> 1.99806297 at 15,222,128 bytes
w640_l12 -> 2.00505534 at 15,658,993 bytes

Current Best Local Result

Current local leader:

Run: highcapdense_rerun_20260319_d576_l14
Shape: 14 layers / 576 dim / 8 heads / 4 KV heads
Knobs:
- COMPRESSION_REG_WEIGHT=0.005
- COMPRESSION_GRID_REG_WEIGHT=0.10
Exact final roundtrip: val_bpb=1.99806297
Total artifact: 15,222,128 bytes

Current Findings

Optimizing for the roundtripped artifact matters more than pre-quant proxy quality.
The fixed-step exact roundtrip local track is much more trustworthy than short local wallclock ranking.
Compression-aware training is the only clearly first-order win so far.
Export-aware grid alignment is promising, but currently a small gain, not a complete solution.
Dense scaling near the byte cap dominates most earlier under-cap micro-ideas.
In the near-cap local regime tested so far, depth currently looks better than width.
Many earlier negative results from recurrence / sparsity / ternary likely reflect the wrong byte regime, not necessarily globally bad ideas.

Current / Next Direction

The branch has now shifted from “small-model micro-tuning” to “near-cap dense frontier + export-side improvements.”

The next likely high-upside directions are:

export-side symmetry-aware permutation for compression
tensor sensitivity mapping / heterogeneous export allocation
building new export ideas on top of the deeper dense 14x576 control instead of the older 6.6 MB model

Caveats

These are local experiments on 1x RTX 3090 under Windows.
They are not leaderboard claims and are not directly comparable to the official 8xH100 / 10 minute challenge runs.
Local evaluation is still a proxy, even though it is now much more reliable than the earlier wallclock track.
No tokenizer work has been explored yet.
No dataset/accounting tricks or external-compute loopholes are being used.

Why This Branch Exists

The goal of this branch is to build a reproducible local search loop that ranks ideas against a closer approximation of the real challenge objective, so the strongest branch can be taken to target compute once grant access is available.

…urrence, factorized embeddings, hybrid eval-time compute & local proxy iteration

…elect the current run, rather than requiring a shell command input

….73287038

… new roundtrip sweep launchers This update brings the local 3090 research branch up to date with the latest roundtrip-proxy findings. Changes: - updated `docs/research_tracks.md` with the completed recurrent/shared-block and sidecar sweep results - marked the dense compression-aware baseline (`COMPRESSION_REG_WEIGHT=0.005`) as the current local leader at `final_int8_zlib_roundtrip_exact val_bpb 2.06085837` - reprioritized the next pivot toward conservative low-bit / ternary shaping on top of the winning dense setup - fixed Run Monitor stale wrap-up handling so incomplete logs no longer show bogus "over expected quantized validation" ETAs - added dedicated sweep launchers for: - roundtrip sidecar tuning - roundtrip ternary / low-bit tuning Current status: - best local matched roundtrip result remains `2.06085837` - sidecar revisit got very close (`2.06132482`) but did not beat the dense winner - recurrent/shared-block variants were not competitive on the local roundtrip track - ternary sweep is now the active next research pivot

…p dense frontier exploration This update brings the local 3090 research branch up to date with the latest fixed-step roundtrip results and dense near-cap experiments. Highlights: - stabilized local methodology around fixed-step post-roundtrip evaluation instead of noisy wallclock-only ranking - confirmed compression-aware training as the only clearly first-order win among the early experimental branches - added export-aware compression regularization work, including grid-alignment and follow-up scale/outlier checks - showed that most small-model micro-ideas (sidecar, ternary, sparse attention, recurrence, residual-budget tuning) do not currently beat the dense compression-aware control on the trusted local track - ran an iso-byte dense sweep, which showed that simply spending more of the byte budget matters much more than small under-cap regularizer gains - extended the dense frontier near the artifact cap and found a new local leader: - `14 layers / 576 dim / 8 heads / 4 KV heads` - `COMPRESSION_REG_WEIGHT=0.005` - `COMPRESSION_GRID_REG_WEIGHT=0.10` - fixed-step exact roundtrip `val_bpb=1.99806297` - total artifact `15,222,128` bytes Current local takeaways: - dense scaling near the byte cap is the dominant direction right now - depth currently looks better than width in the near-cap regime tested so far - the next likely high-upside branch is export-side work on top of the deeper dense control, not more small-model sidecar or low-bit sweeps Also included: - updated `docs/research_tracks.md` - added/updated local sweep scripts for fixed-step export-aware, iso-byte, and high-cap dense experiments - hardened parts of the local sweep process after finding launcher/harness issues during larger runs

…act of 15,869,071 bytes on my 3090

…penai#33) The previous fix (PR openai#32) extracted the JSON correctly but then piped the raw verdict string ('NOT YET') into $GITHUB_OUTPUT without a key, which the runner rejects: Unable to process file command 'output' successfully. Invalid format 'NOT YET' Fix: write 'verdict=<value>' instead. Also replace the space inside the verdict ('GATE-2 PASS', 'NOT YET') with an underscore so the value is a single token, since GITHUB_OUTPUT doesn't accept multi-word unencoded values without the multiline EOF marker. This output is informational only — the digest step reads from the JSON file directly via jq, so the encoding change has no downstream effect. Refs openai#16. Co-authored-by: Perplexity Computer <computer@perplexity.ai>

JusticeShultz added 5 commits March 18, 2026 16:58

State tracking, post-compression-aware training, weight sharing / rec…

4e7e808

…urrence, factorized embeddings, hybrid eval-time compute & local proxy iteration

Run monitor can now be ran through a batch script, and it will auto s…

b6d5010

…elect the current run, rather than requiring a shell command input

Adding completion results to the snapshot section

9c83284

Setting a baeline bpb on my 3090 at 2.0916, vs full fidelity bpb of 2…

a02cae9

….73287038

More research

1ebce9e

JusticeShultz changed the title ~~[WIP] Compression-aware roundtrip-proxy research for Parameter Golf~~ [WIP] Compression-aware roundtrip-proxy research Mar 18, 2026

JusticeShultz added 4 commits March 18, 2026 19:37

Further research and fixes to the Run Monitor tool

4b1daae

Further research

d752d7a

JusticeShultz changed the title ~~[WIP] Compression-aware roundtrip-proxy research~~ [WIP] Fixed-step compression-aware research and dense near-cap frontier Mar 19, 2026

JusticeShultz changed the title ~~[WIP] Fixed-step compression-aware research and dense near-cap frontier~~ [WIP] Compression-aware fixed-step research Mar 19, 2026

Further research work, currently at 1.89329916 bpb with a total artif…

fc3c143

…act of 15,869,071 bytes on my 3090

0hq added the not ready for review label Mar 19, 2026

0hq closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Compression-aware fixed-step research#33

[WIP] Compression-aware fixed-step research#33
JusticeShultz wants to merge 10 commits intoopenai:mainfrom
JusticeShultz:main

JusticeShultz commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JusticeShultz commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This Branch Adds

Compression-aware / export-aware training knobs

Architecture / export knobs

Hybrid eval-time knobs

Local evaluation / search controls

What Has Been Tried

1. Matched local roundtrip baseline

2. Compression-aware baseline

3. Fixed-step methodology pivot

4. Export-aware compression regularization

5. Branches that did not currently win locally

6. Iso-byte dense frontier sweep

7. High-cap dense width/depth frontier

Current Best Local Result

Current Findings

Current / Next Direction

Caveats

Why This Branch Exists

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JusticeShultz commented Mar 18, 2026 •

edited

Loading