record 2025-08-28; medium track; two more value-embeddings#119
record 2025-08-28; medium track; two more value-embeddings#119ClassicLarry merged 2 commits intoKellerJordan:masterfrom
Conversation
|
Copying in the README for convenience: Record 28th of August, 2025Statistics about the measured results for the baseline and the updated run. Full code here. BaselineChangelog:
Final val_losses - baselineHere's the list of final validation losses over 28 runs: [2.919321, 2.919348, 2.920103, 2.920455, 2.91936, 2.919165, 2.920336, 2.919816, 2.919747, 2.918579, 2.920533, 2.920729, 2.918076, 2.919458, 2.920759, 2.919738, 2.92073, 2.919327, 2.919639, 2.91942, 2.920585, 2.920464, 2.918828, 2.919279, 2.920514, 2.919351, 2.918162, 2.920809]Here are some simple statistics:
Here are the t-test results: {
"n": 28,
"sample_mean": 2.9197368214285713,
"sample_std": 0.0007835970267072948,
"t_stat": -1.7772018695052398,
"p_value": 0.04340171700038697,
"alpha": 0.05,
"decision": REJECT H0 (mean < threshold),
"upper_conf_bound_mean": 2.9199890544627403,
"threshold": 2.92
}Run times - baselineHere are the raw run times over 28 runs: ['1450.149', '1447.753', '1447.248', '1448.042', '1446.999', '1447.910', '1447.621', '1447.163', '1448.034', '1448.266', '1448.380', '1447.248', '1448.169', '1451.810', '1448.287', '1449.739', '1449.761', '1453.234', '1449.403', '1450.164', '1448.897', '1450.096', '1449.720', '1449.535', '1449.472', '1448.813', '1450.895', '1450.256']And here are some simple statistics about the run times:
Two new value embeddingsThe actual record. Changelog:
Final val_losses - recordThe raw final validation losses over 37 runs: [2.919612, 2.919458, 2.918941, 2.917664, 2.91856, 2.919706, 2.919218, 2.918082, 2.919345, 2.920486, 2.919293, 2.917286, 2.921162, 2.919861, 2.917587, 2.919488, 2.919955, 2.919172, 2.919245, 2.918839, 2.918381, 2.919301, 2.917944, 2.919178, 2.918395, 2.920141, 2.918754, 2.918432, 2.919958, 2.91978, 2.919916, 2.919711, 2.918025, 2.919342, 2.920571, 2.917387, 2.919093]Simple statistics:
T-test results: {
"n": 37,
"sample_mean": 2.919115378378378,
"sample_std": 0.000915598388163916,
"t_stat": -5.876968901489202,
"p_value": 5.07368129288152e-07,
"alpha": 0.05,
"decision": REJECT H0 (mean < threshold),
"upper_conf_bound_mean": 2.9193695067707086,
"threshold": 2.92
}Run times - recordRaw run times: ['1421.024', '1420.776', '1422.277', '1422.077', '1422.587', '1421.731', '1421.276', '1421.190', '1421.335', '1421.321', '1421.373', '1430.659', '1424.760', '1423.293', '1421.603', '1422.789', '1422.489',
'1455.587', '1421.598', '1424.514', '1425.991', '1423.341', '1444.257', '1465.063', '1428.880', '1430.782', '1435.003', '1426.705', '1423.921', '1424.339', '1423.867', '1423.950', '1424.241', '1467.321',
'1424.330', '1424.331', '1424.449']Simple statistics: Mean: 1427.704 ± 11.387 |
|
IIRC baseline has 5960 steps? |
|
compiled autograd will cause flexattention error again in the version of pytorch you use? |
Oh, you're right, sorry. I mistyped the number of steps for the baseline, I actually did use 5960 steps for the baseline, see here. The 5820 steps that I changed it to are correct though; see here (that's the file I ran so often). Apologies for the oversight, no idea how that happened. |
Yes, it did so for at least a few weeks (and I always run |
|
oh I will try to reproduce the error and open an issue in pytorch repo. |
New Medium Track record.
Changelog to baseline:
torch._dynamo.config.compiled_autograd = Trueflag (it caused flexattention errors)_patched_trace_structuredfunction (it often causes errors)Mean time: 24.15 minutes
Changelog to record:
Mean time: 23.8 minutes
PyTorch version:
torch==2.9.0.dev20250824+cu126For details, see the README.md that comes with the record, or this link.