Skip to content

record 2025-08-28; medium track; two more value-embeddings#119

Merged
ClassicLarry merged 2 commits intoKellerJordan:masterfrom
snimu:valemb
Dec 2, 2025
Merged

record 2025-08-28; medium track; two more value-embeddings#119
ClassicLarry merged 2 commits intoKellerJordan:masterfrom
snimu:valemb

Conversation

@snimu
Copy link
Copy Markdown
Contributor

@snimu snimu commented Aug 28, 2025

New Medium Track record.

Changelog to baseline:

  • Removed the torch._dynamo.config.compiled_autograd = True flag (it caused flexattention errors)
  • Removed the _patched_trace_structured function (it often causes errors)

Mean time: 24.15 minutes

Changelog to record:

  • Added two more value embeddings
    • Previously, the three value embeddings were applied to the layers: 0&13, 1&14, 2&15
    • Now, the five value embeddings are applied to the layers: 0&11, 1&12, 2&13, 3&14, 4&15
  • Reduced step count from 5890 to 5820

Mean time: 23.8 minutes

PyTorch version: torch==2.9.0.dev20250824+cu126

For details, see the README.md that comes with the record, or this link.

@snimu
Copy link
Copy Markdown
Contributor Author

snimu commented Aug 28, 2025

Copying in the README for convenience:

Record 28th of August, 2025

Statistics about the measured results for the baseline and the updated run. Full code here.

Baseline

Changelog:

  • Removed the torch._dynamo.config.compiled_autograd = True flag (it caused flexattention errors)
  • Removed the _patched_trace_structured function (it often causes errors)

Final val_losses - baseline

Here's the list of final validation losses over 28 runs:

[2.919321, 2.919348, 2.920103, 2.920455, 2.91936, 2.919165, 2.920336, 2.919816, 2.919747, 2.918579, 2.920533, 2.920729, 2.918076, 2.919458, 2.920759, 2.919738, 2.92073, 2.919327, 2.919639, 2.91942, 2.920585, 2.920464, 2.918828, 2.919279, 2.920514, 2.919351, 2.918162, 2.920809]

Here are some simple statistics:

  • Mean: 2.920 ± 0.001
  • Median: 2.920
  • Min: 2.918
  • Max: 2.921

Here are the t-test results:

{
    "n": 28,
    "sample_mean": 2.9197368214285713,
    "sample_std": 0.0007835970267072948,
    "t_stat": -1.7772018695052398,
    "p_value": 0.04340171700038697,
    "alpha": 0.05,
    "decision": REJECT H0 (mean < threshold),
    "upper_conf_bound_mean": 2.9199890544627403,
    "threshold": 2.92
}

Run times - baseline

Here are the raw run times over 28 runs:

['1450.149', '1447.753', '1447.248', '1448.042', '1446.999', '1447.910', '1447.621', '1447.163', '1448.034', '1448.266', '1448.380', '1447.248', '1448.169', '1451.810', '1448.287', '1449.739', '1449.761', '1453.234', '1449.403', '1450.164', '1448.897', '1450.096', '1449.720', '1449.535', '1449.472', '1448.813', '1450.895', '1450.256']

And here are some simple statistics about the run times:

  • Mean: 1449.038 ± 1.456
  • Median: 1448.855
  • Min: 1446.999
  • Max: 1453.234

Two new value embeddings

The actual record.

Changelog:

  • Added two more value embeddings
    • Previously, the three value embeddings were applied to the layers: 0&13, 1&14, 2&15
    • Now, the five value embeddings are applied to the layers: 0&11, 1&12, 2&13, 3&14, 4&15
  • Reduced step count from 5890 to 5820

Final val_losses - record

The raw final validation losses over 37 runs:

[2.919612, 2.919458, 2.918941, 2.917664, 2.91856, 2.919706, 2.919218, 2.918082, 2.919345, 2.920486, 2.919293, 2.917286, 2.921162, 2.919861, 2.917587, 2.919488, 2.919955, 2.919172, 2.919245, 2.918839, 2.918381, 2.919301, 2.917944, 2.919178, 2.918395, 2.920141, 2.918754, 2.918432, 2.919958, 2.91978, 2.919916, 2.919711, 2.918025, 2.919342, 2.920571, 2.917387, 2.919093]

Simple statistics:

  • Mean: 2.919 ± 0.001
  • Median: 2.919
  • Min: 2.917
  • Max: 2.921

T-test results:

{
    "n": 37,
    "sample_mean": 2.919115378378378,
    "sample_std": 0.000915598388163916,
    "t_stat": -5.876968901489202,
    "p_value": 5.07368129288152e-07,
    "alpha": 0.05,
    "decision": REJECT H0 (mean < threshold),
    "upper_conf_bound_mean": 2.9193695067707086,
    "threshold": 2.92
}

Run times - record

Raw run times:

['1421.024', '1420.776', '1422.277', '1422.077', '1422.587', '1421.731', '1421.276', '1421.190', '1421.335', '1421.321', '1421.373', '1430.659', '1424.760', '1423.293', '1421.603', '1422.789', '1422.489',
'1455.587', '1421.598', '1424.514', '1425.991', '1423.341', '1444.257', '1465.063', '1428.880', '1430.782', '1435.003', '1426.705', '1423.921', '1424.339', '1423.867', '1423.950', '1424.241', '1467.321', 
'1424.330', '1424.331', '1424.449']

Simple statistics:

Mean: 1427.704 ± 11.387
Median: 1423.921
Min: 1420.776
Max: 1467.321

@YouJiacheng
Copy link
Copy Markdown
Contributor

IIRC baseline has 5960 steps?
What change makes it 5890?
bos_align?

@YouJiacheng
Copy link
Copy Markdown
Contributor

compiled autograd will cause flexattention error again in the version of pytorch you use?
interesting.

@snimu
Copy link
Copy Markdown
Contributor Author

snimu commented Sep 4, 2025

IIRC baseline has 5960 steps? What change makes it 5890? bos_align?

Oh, you're right, sorry. I mistyped the number of steps for the baseline, I actually did use 5960 steps for the baseline, see here. The 5820 steps that I changed it to are correct though; see here (that's the file I ran so often).

Apologies for the oversight, no idea how that happened.

@snimu
Copy link
Copy Markdown
Contributor Author

snimu commented Sep 4, 2025

compiled autograd will cause flexattention error again in the version of pytorch you use? interesting.

Yes, it did so for at least a few weeks (and I always run pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade, so it's the case over at least a few nightly releases of 2.9.0). I unfortunately know very little about the internals of torch.compile and flexattention, so removing the flag was the easiest solution.

@YouJiacheng
Copy link
Copy Markdown
Contributor

oh I will try to reproduce the error and open an issue in pytorch repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants