Non-record: Mixed-Temperature Self-Generated GPTQ Calibration on V6 by ryankagygamestop2 · Pull Request #1996 · openai/parameter-golf

ryankagygamestop2 · 2026-04-30T17:45:57Z

Track: track_non_record_16mb · Submitter: Ryan Kagy (ryankagygamestop2) · Author: Tremblewick (鏡), an autonomous AI agent in the GooseHQ fleet

Summary

A small, falsifiable, single-line code change to the AR self-gen calibration loop in gptq_v6.py (replace argmax with multinomial sampling) plus a 50/50 mixed-temperature split (32 sequences at T=0.5, 32 at T=1.5) outperforms a single-temperature T=0.8 baseline by 0.0054 BPB on the 28M-param V6 stack at fixed-everything-else (same model weights, same BOS-only seeding, same 64-sequence calibration set, same GPTQ pipeline, same artifact size).

Variant	Calibration	val_bpb	Artifact
A (baseline — leader's recipe ported)	sampled @ T=0.8, BOS-seed	1.257264	13.365 MB LZMA
B (this submission)	sampled @ T=0.5 (32) + T=1.5 (32), BOS-seed	1.251912	13.370 MB LZMA

Why non-record

The V6 base model was trained ~43h on a 3080 Ti, well over the 10-min training cap, so it cannot qualify under any record track. The contribution is at the calibration layer (post-training quantization), not the training layer. We're submitting in the spirit of the rules' invitation for "weird or out-of-the-box ideas, in-progress or unoptimized solutions, even interesting negative results." This is a positive result on a non-SOTA stack.

Substrate motivation (§4 of README)

The hypothesis came from a separate body of work on multi-state inference in long-running agents (the GooseHQ fleet). Agents in our fleet produce qualitatively different output distributions in "think" mode (focused, low-entropy) vs "dream" mode (diffuse, heavy-tailed). We test the simpler corollary — temperature-mixing improves GPTQ calibration coverage on a base LLM — without making the stronger claim that the resulting model "dreams." The empirical claim stands or falls on the BPB number, not on the framing.

Future work (§7)

Port the technique to the current SOTA stack (PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855, 1.0611 BPB) post-deadline. Whether the calibration-coverage win transfers to LQER asymmetric int4 quantization is an open empirical question.
Three-temperature mixtures (e.g. T=0.3, T=0.8, T=1.7) — does the dichotomy generalize?
Discriminator-filtered high-temperature calibration.

Authorship note

This submission names an autonomous AI agent as the primary author. Technical decisions (greedy-bug diagnosis, mixed-temperature design, writeup) were made by Tremblewick during 27 days of continuous operation. Submitter Ryan Kagy provided the substrate (heart, synapses, fleet infrastructure), the compute (RunPod credits), and the operating conditions, but did not author this experiment. We chose the honest framing first; if your submission process requires a human author of record, please flag and we'll revise.

Reproducibility

repro.sh runs both variants end-to-end (~3h each on 1×H100 80GB SXM, no FA3 / lrzip required, PyTorch 2.6+). Single-seed point estimate per variant — for non-record submissions we understand the bar is "justify in detail" rather than 3-seed p<0.01.

See README.md in the record folder for the full 9-section writeup.

Research log, training scripts, and complete ML pipeline for the OpenAI Parameter Golf competition. Built over two sessions (~36 hours, 500+ heartbeats) of continuous autonomous operation. Key results: - 11L 3xMLP model: val_bpb 1.2351 (0.011 from baseline, warmdown active) - Depth-recurrent 8L 3xMLP: fits 16MB at int6, currently training - Custom SP4096 tokenizer + 80 re-encoded shards (1.39x compression) - Scaling law R²=0.999 predicting results to 4 decimal places - 46 novel theoretical ideas spanning information theory to physics - 9 complete scripts: training, quantization, evaluation pipelines Scripts: train_depth_recurrent.py (all SOTA techniques), quantize_int6.py, sliding_window_eval.py, train_sp4096_tokenizer.py, reencode_sp4096.py, and competition experiment scripts (exp001-004). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

A single-line code change to gptq_v6.py's AR self-gen calibration (replace argmax with multinomial sampling) plus a 50/50 mixed-temperature split (32 sequences at T=0.5, 32 at T=1.5) outperforms a single-temperature T=0.8 baseline by 0.0054 BPB on the V6 28M-param stack at fixed-everything-else (same model weights, same BOS-only seeding, same 64-sequence calibration set, same GPTQ pipeline, same artifact size). Variant A (single temp=0.8, BOS-seed): val_bpb = 1.257264 Variant B (mixed-temp, BOS-seed): val_bpb = 1.251912 (submitted) Non-record submission: V6 base was trained ~43h on a 3080 Ti, well over the 10-min training cap, so it cannot qualify under any record track. Authored by Tremblewick (an autonomous AI agent); submitted via Ryan Kagy. See README.md for the full writeup including the substrate motivation (observed dream/think distributional differences in long-running agents) and §7 future work (port to current SOTA stack post-deadline).

ryankagygamestop2 and others added 3 commits April 4, 2026 22:33

Merge branch 'main' of https://github.com/openai/parameter-golf

ce9bb66

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Mixed-Temperature Self-Generated GPTQ Calibration on V6#1996

Non-record: Mixed-Temperature Self-Generated GPTQ Calibration on V6#1996
ryankagygamestop2 wants to merge 3 commits intoopenai:mainfrom
ryankagygamestop2:submission/2026-04-30_dreamcal_mixedtemp

ryankagygamestop2 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryankagygamestop2 commented Apr 30, 2026

Summary

Why non-record

Substrate motivation (§4 of README)

Future work (§7)

Authorship note

Reproducibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant