A Tiered Residual Predictor for Parameter Golf

## Abstract

Parameter Golf rewards end-to-end compression quality under a 16 MB artifact cap, a 10-minute training budget on 8×H100s, and BPB evaluation on FineWeb validation. The current legal leaderboard leader is a heavily optimized 11-layer, 512d transformer stack with BigramHash, all-layer XSA, partial RoPE, Full Hessian GPTQ int6, and LZMA, scoring 1.1147 BPB at about 15.9 MB.

This paper proposes a different hypothesis: **training efficiency, not artifact size alone, is now the tighter bottleneck**, and the next gain may come from splitting prediction into tiers rather than continuing to enlarge or polish a monolithic tiny transformer.

## Hypothesis

A fixed-predictor submission can improve BPB per training second by decomposing next-token prediction into three stages:

1. a **Tier 0** structural prior for cheap formatting and token-class regularities,
2. a **Tier 1** local lexical prior similar to, but potentially stronger than, BigramHash, and
3. a **Tier 2** neural residual model trained primarily on what the first two tiers miss.

The core claim is that the transformer should not spend scarce 600-second training budget rediscovering easy local web-text regularities that cheaper tiers can absorb quickly. FineWeb-style web data contains abundant short-range structure, and the current frontier already implicitly exploits this through components like BigramHash and sliding-window evaluation.

## Motivation

The present leaderboard suggests that participants have already extracted strong gains from the same broad family of ideas: local priors, widened MLPs, better quantization, and better evaluation-time context use. The current SOTA explicitly widened BigramHash to 3072×112 because that setting still fit under the 16 MB cap, and it dropped TTT after many negative attempts on the stronger GPTQ stack. This suggests two things:

1. Local priors are genuinely useful.
2. Once artifact bytes are controlled, not all added complexity converts into better BPB within the wall-clock budget.

## Proposed Architecture

**Tier 0** is a near-free prior built from token classes and simple structural features: leading-space behavior, punctuation continuation, boundary conditions, quote and bracket patterns, and similar high-frequency regularities.

**Tier 1** is a learned local lexical prior, such as a hashed bigram or short-context lexical table, that captures common token-to-token transitions efficiently.

**Tier 2** is a smaller transformer trained on top of the first two tiers, with combined logits of the form:

```
ℓ_final = ℓ_0 + ℓ_1 + ℓ_2
```

where `ℓ_0` is structural, `ℓ_1` is local lexical, and `ℓ_2` is the neural residual. The design goal is not merely to "help" the transformer, but to **change its job**: it should focus on longer-range, semantic, and corrective behavior.

## Why This May Work

The current leaderboard stack already demonstrates that cheap local priors can buy real BPB. BigramHash is present in the leading legal submission, and Issue openai/parameter-golf#140 treats BigramHash plus related local-bias components as part of the mainstream high-performing recipe.

The hypothesis here is stronger: those components should be treated not as side aids but as **primary early predictors** that saturate quickly, allowing the neural core to become smaller, faster, or more artifact-efficient.

This is attractive because each tier should converge at a different rate:
- **Tier 0** should require little or no SGD
- **Tier 1** should learn quickly
- **Tier 2** is slower, but it now trains on a cleaner residual problem

If true, this shifts the main training burden toward the genuinely hard part of the distribution.

## Legality and Evaluation

This proposal is designed for the current **Track A fixed-predictor** interpretation in the field guide (openai/parameter-golf#1017): the model is trained, then evaluated without updating useful model state from validation tokens. Sliding-window attention patterns, sequence-length changes, and other inference optimizations are allowed, but eval-built caches, TTT, and adaptive mixing from eval-derived statistics are not. Evaluation must remain a single left-to-right pass.

A tiered fixed predictor fits that regime naturally.

## Training Plan

- **Tier 0** can be fit from counts or very light calibration, not full neural training
- **Tier 1** can be trained briefly either alone or jointly, then frozen or partially frozen
- **Tier 2** can then be trained on the combined predictor

Compression-aware shaping remains a late-stage step, consistent with current successful stacks that apply quantization and export-oriented techniques after strong base training. The current SOTA itself uses late QAT, GPTQ, and LZMA as part of the final artifact path rather than as the sole source of gains.

## Falsifiable Predictions

This hypothesis predicts three measurable outcomes:

1. A model with tiers 0 and 1 plus a smaller transformer should **beat a plain transformer of similar training cost in loss-vs-time**, not just final loss.
2. Tiers 0 and 1 should show **meaningful standalone gain early**, proving they absorb easy structure rather than merely duplicating the neural core.
3. The combined model should permit later compression shaping **without needing a much larger transformer** to reach similar BPB.

## Conclusion

The central bet is that Parameter Golf's current frontier is bottlenecked less by "how many parameters fit" and more by **"how much useful prediction can be learned in 600 seconds."** The leading stack already hints that local priors matter. This paper proposes taking that logic seriously: build a predictor whose cheapest tiers own the easy part of web-text compression, and train the transformer as a residual model on top. If that decomposition works, it offers a path that is adjacent to the leaderboard's lessons without merely becoming "one more tweak" of the same tiny-transformer recipe.

---

*Relates to: #1 (context extension + MDL exploration on `claude/explore-alternative-solutions-uLbe5`)*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Tiered Residual Predictor for Parameter Golf #2

Abstract

Hypothesis

Motivation

Proposed Architecture

Why This May Work

Legality and Evaluation

Training Plan

Falsifiable Predictions

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A Tiered Residual Predictor for Parameter Golf #2

Description

Abstract

Hypothesis

Motivation

Proposed Architecture

Why This May Work

Legality and Evaluation

Training Plan

Falsifiable Predictions

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions