Commit 04d35ed
committed
fix(submission): use canonical byte-count exporter in prepare_caseops_data.py
The shipped `_token_original_byte_counts` used a try/except surface-walk
that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND
failed to advance `cursor_o`, over-counting validation bytes by ~8.37%
on FineWeb. The training sidecar actually used (built from a different
internal path via `surface_piece_original_byte_counts`) is correct, so
the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped
prep script could not reproduce the sidecar from a cold checkout.
Swap the buggy inline walker for a direct delegation to
`surface_piece_original_byte_counts` from `lossless_caps.py` (the same
canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified
on 500 FineWeb val docs: patched output matches the shipped sidecar
token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly.
Also clean up README prose for the 04-24 record: SmearGate is a gate
on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token
causal lookback (not a 12-token residual window); LQER asymmetric
stores A as INT2 per-matrix and B as INT4 per-group-64 and selects
K=3 whole tensors globally (not per-row output columns).1 parent 8e4bea2 commit 04d35ed
3 files changed
Lines changed: 40 additions & 82 deletions
File tree
- records/track_10min_16mb
- 2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12
- 2026-04-24_PR1787Base_Smear_LQERAsym_PhasedTTT_1.06157
Lines changed: 17 additions & 38 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
57 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
58 | 62 | | |
59 | 63 | | |
60 | 64 | | |
| |||
92 | 96 | | |
93 | 97 | | |
94 | 98 | | |
95 | | - | |
96 | | - | |
97 | | - | |
98 | | - | |
99 | | - | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
108 | 105 | | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
133 | 112 | | |
134 | 113 | | |
135 | 114 | | |
| |||
Lines changed: 6 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
33 | | - | |
| 32 | + | |
| 33 | + | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
41 | | - | |
42 | | - | |
| 41 | + | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
87 | | - | |
| 87 | + | |
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
| |||
93 | 93 | | |
94 | 94 | | |
95 | 95 | | |
96 | | - | |
| 96 | + | |
97 | 97 | | |
98 | 98 | | |
99 | 99 | | |
| |||
Lines changed: 17 additions & 38 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
57 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
58 | 62 | | |
59 | 63 | | |
60 | 64 | | |
| |||
92 | 96 | | |
93 | 97 | | |
94 | 98 | | |
95 | | - | |
96 | | - | |
97 | | - | |
98 | | - | |
99 | | - | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
108 | 105 | | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
133 | 112 | | |
134 | 113 | | |
135 | 114 | | |
| |||
0 commit comments