Skip to content

[WIP] Compression-aware fixed-step research#33

Closed
JusticeShultz wants to merge 10 commits intoopenai:mainfrom
JusticeShultz:main
Closed

[WIP] Compression-aware fixed-step research#33
JusticeShultz wants to merge 10 commits intoopenai:mainfrom
JusticeShultz:main

Conversation

@JusticeShultz
Copy link
Copy Markdown

@JusticeShultz JusticeShultz commented Mar 18, 2026

Summary

This is a WIP local research branch for Parameter Golf focused on improving post-roundtrip val_bpb under the actual artifact constraint, using 1x RTX 3090 for local search before moving to the target 8xH100 environment.

The biggest shift in this branch is methodological: local ranking moved away from noisy short wallclock runs and onto a fixed-step exact roundtrip track. That made the local loop much more trustworthy and changed the research conclusions substantially.

The current branch focus is now:

  • compression-aware training
  • export-aware regularization
  • dense near-cap scaling
  • fixed-step post-roundtrip evaluation

As of March 19, 2026, train_gpt.py remains under the repo hard cap at 1492 lines.

What This Branch Adds

Compression-aware / export-aware training knobs

  • COMPRESSION_REG_WEIGHT
  • COMPRESSION_GRID_REG_WEIGHT
  • COMPRESSION_SCALE_REG_WEIGHT
  • COMPRESSION_RANK1_REG_WEIGHT
  • TERNARY_REG_WEIGHT
  • OUTLIER_REG_WEIGHT

Architecture / export knobs

  • NUM_UNIQUE_BLOCKS
  • WINDOW_SIZE
  • EMBED_DIM
  • INT8_AXIS_MODE
  • INT8_RESIDUAL_RANK
  • INT8_RESIDUAL_BUDGET_BYTES

Hybrid eval-time knobs

  • EVAL_CACHE_MIX_WEIGHT
  • EVAL_BIGRAM_MIX_WEIGHT
  • EVAL_CACHE_SIZE

Local evaluation / search controls

  • VAL_MAX_TOKENS
  • ROUNDTRIP_VAL_MAX_TOKENS
  • FINAL_ROUNDTRIP_EVAL
  • fixed-step local runs via ITERATIONS
  • dedicated local sweep launchers for:
    • fixed-step roundtrip comparisons
    • export-aware sweeps
    • iso-byte dense sweeps
    • high-cap dense width/depth sweeps

What Has Been Tried

1. Matched local roundtrip baseline

  • Run: baselinert3090_20260318_181344
  • Exact final roundtrip: val_bpb=2.11089617
  • Artifact: 6,705,058 bytes

2. Compression-aware baseline

  • Run: compressrt3090_20260318_175828
  • Knobs: COMPRESSION_REG_WEIGHT=0.005
  • Exact final roundtrip: val_bpb=2.06085837
  • Artifact: 6,839,798 bytes

This was the first clear local win over the matched baseline.

3. Fixed-step methodology pivot

The local wallclock track turned out to be too noisy to trust for small deltas, so ranking moved to a fixed-step exact roundtrip track.

Dense fixed-step control:

  • Run: fixedsteprtsweep_20260318_221632_base_a
  • Exact final roundtrip: val_bpb=2.04299145

This became the new local control.

4. Export-aware compression regularization

Best export-aware probe:

  • Run: exportaware_fixedstep_20260318_223456_g010_r000
  • Knobs:
    • COMPRESSION_REG_WEIGHT=0.005
    • COMPRESSION_GRID_REG_WEIGHT=0.10
  • Exact final roundtrip: val_bpb=2.04288777
  • Artifact: 6,663,470 bytes

This slightly but repeatably improved the fixed-step control.

Follow-up export-aware checks:

  • scale-aware regularization: regressed
  • nearby grid weights (0.08, 0.12): regressed
  • tiny outlier suppression: regressed

5. Branches that did not currently win locally

These were tested and are currently parked:

  • hybrid eval-time sidecar
  • sparse attention
  • recurrent/shared-block variants
  • ternary/low-bit shaping
  • residual-budget tuning
  • factorized/reused-block variants in the tested regime

Important nuance: these negative results were gathered before or outside the stronger near-cap dense regime, so they are not being treated as globally dead ideas.

6. Iso-byte dense frontier sweep

This changed the branch direction significantly.

Results:

  • b10 -> 2.02814871 at 9,683,932 bytes
  • b12 -> 2.05262920 at 11,334,608 bytes
  • b14 -> 2.03768242 at 13,094,288 bytes
  • b155 -> 2.00290272 at 13,741,308 bytes

This showed that simply spending more of the byte budget on a dense compression-aware model mattered much more than most under-cap micro-ideas.

7. High-cap dense width/depth frontier

Recovered / rerun near-cap results:

  • w608_l12 -> 2.00551677 at 14,371,393 bytes
  • w624_l12 -> 2.01128088 at 15,024,114 bytes
  • d576_l14 -> 1.99806297 at 15,222,128 bytes
  • w640_l12 -> 2.00505534 at 15,658,993 bytes

Current Best Local Result

Current local leader:

  • Run: highcapdense_rerun_20260319_d576_l14
  • Shape: 14 layers / 576 dim / 8 heads / 4 KV heads
  • Knobs:
    • COMPRESSION_REG_WEIGHT=0.005
    • COMPRESSION_GRID_REG_WEIGHT=0.10
  • Exact final roundtrip: val_bpb=1.99806297
  • Total artifact: 15,222,128 bytes

Current Findings

  • Optimizing for the roundtripped artifact matters more than pre-quant proxy quality.
  • The fixed-step exact roundtrip local track is much more trustworthy than short local wallclock ranking.
  • Compression-aware training is the only clearly first-order win so far.
  • Export-aware grid alignment is promising, but currently a small gain, not a complete solution.
  • Dense scaling near the byte cap dominates most earlier under-cap micro-ideas.
  • In the near-cap local regime tested so far, depth currently looks better than width.
  • Many earlier negative results from recurrence / sparsity / ternary likely reflect the wrong byte regime, not necessarily globally bad ideas.

Current / Next Direction

The branch has now shifted from “small-model micro-tuning” to “near-cap dense frontier + export-side improvements.”

The next likely high-upside directions are:

  • export-side symmetry-aware permutation for compression
  • tensor sensitivity mapping / heterogeneous export allocation
  • building new export ideas on top of the deeper dense 14x576 control instead of the older 6.6 MB model

Caveats

  • These are local experiments on 1x RTX 3090 under Windows.
  • They are not leaderboard claims and are not directly comparable to the official 8xH100 / 10 minute challenge runs.
  • Local evaluation is still a proxy, even though it is now much more reliable than the earlier wallclock track.
  • No tokenizer work has been explored yet.
  • No dataset/accounting tricks or external-compute loopholes are being used.

Why This Branch Exists

The goal of this branch is to build a reproducible local search loop that ranks ideas against a closer approximation of the real challenge objective, so the strongest branch can be taken to target compute once grant access is available.

@JusticeShultz JusticeShultz changed the title [WIP] Compression-aware roundtrip-proxy research for Parameter Golf [WIP] Compression-aware roundtrip-proxy research Mar 18, 2026
… new roundtrip sweep launchers

This update brings the local 3090 research branch up to date with the latest roundtrip-proxy findings.

Changes:
- updated `docs/research_tracks.md` with the completed recurrent/shared-block and sidecar sweep results
- marked the dense compression-aware baseline (`COMPRESSION_REG_WEIGHT=0.005`) as the current local leader at `final_int8_zlib_roundtrip_exact val_bpb 2.06085837`
- reprioritized the next pivot toward conservative low-bit / ternary shaping on top of the winning dense setup
- fixed Run Monitor stale wrap-up handling so incomplete logs no longer show bogus "over expected quantized validation" ETAs
- added dedicated sweep launchers for:
  - roundtrip sidecar tuning
  - roundtrip ternary / low-bit tuning

Current status:
- best local matched roundtrip result remains `2.06085837`
- sidecar revisit got very close (`2.06132482`) but did not beat the dense winner
- recurrent/shared-block variants were not competitive on the local roundtrip track
- ternary sweep is now the active next research pivot
…p dense frontier exploration

This update brings the local 3090 research branch up to date with the latest fixed-step roundtrip results and dense near-cap experiments.

Highlights:
- stabilized local methodology around fixed-step post-roundtrip evaluation instead of noisy wallclock-only ranking
- confirmed compression-aware training as the only clearly first-order win among the early experimental branches
- added export-aware compression regularization work, including grid-alignment and follow-up scale/outlier checks
- showed that most small-model micro-ideas (sidecar, ternary, sparse attention, recurrence, residual-budget tuning) do not currently beat the dense compression-aware control on the trusted local track
- ran an iso-byte dense sweep, which showed that simply spending more of the byte budget matters much more than small under-cap regularizer gains
- extended the dense frontier near the artifact cap and found a new local leader:
  - `14 layers / 576 dim / 8 heads / 4 KV heads`
  - `COMPRESSION_REG_WEIGHT=0.005`
  - `COMPRESSION_GRID_REG_WEIGHT=0.10`
  - fixed-step exact roundtrip `val_bpb=1.99806297`
  - total artifact `15,222,128` bytes

Current local takeaways:
- dense scaling near the byte cap is the dominant direction right now
- depth currently looks better than width in the near-cap regime tested so far
- the next likely high-upside branch is export-side work on top of the deeper dense control, not more small-model sidecar or low-bit sweeps

Also included:
- updated `docs/research_tracks.md`
- added/updated local sweep scripts for fixed-step export-aware, iso-byte, and high-cap dense experiments
- hardened parts of the local sweep process after finding launcher/harness issues during larger runs
@JusticeShultz JusticeShultz changed the title [WIP] Compression-aware roundtrip-proxy research [WIP] Fixed-step compression-aware research and dense near-cap frontier Mar 19, 2026
@JusticeShultz JusticeShultz changed the title [WIP] Fixed-step compression-aware research and dense near-cap frontier [WIP] Compression-aware fixed-step research Mar 19, 2026
@0hq 0hq closed this Mar 19, 2026
gHashTag added a commit to gHashTag/parameter-golf that referenced this pull request Apr 30, 2026
…penai#33)

The previous fix (PR openai#32) extracted the JSON correctly but then piped
the raw verdict string ('NOT YET') into $GITHUB_OUTPUT without a key,
which the runner rejects:

    Unable to process file command 'output' successfully.
    Invalid format 'NOT YET'

Fix: write 'verdict=<value>' instead. Also replace the space inside
the verdict ('GATE-2 PASS', 'NOT YET') with an underscore so the value
is a single token, since GITHUB_OUTPUT doesn't accept multi-word
unencoded values without the multiline EOF marker.

This output is informational only — the digest step reads from the
JSON file directly via jq, so the encoding change has no downstream
effect.

Refs openai#16.

Co-authored-by: Perplexity Computer <computer@perplexity.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants