Add semi-hidden feature to allow recording metrics at the end of each training epoch by drewoldag · Pull Request #540 · lincc-frameworks/hyrax

drewoldag · 2025-11-20T20:08:49Z

This change add a new event handler to the training engine that will look for a model method named log_epoch_metrics. If found, it will call the method on the model at the end of each training epoch.

It is expected (though not enforced) that log_epoch_metrics will return a dictionary similar to what train_step returns. We will then log each key of the dictionary under training/training/epoch/<foo> in TensorBoard or training/epoch/<foo> in MLFlow.

The anticipated use case is that model developers will accumulate values in train_step over the course of an epoch, then use those values to calculate a metric at the end of the epoch before resetting the accumulators.

Just a simple example after implementing log_epoch_metrics in HyraxAutoencoder that returns 3 "metrics" with random values over the course of training for 5 epochs.

… each epoch.

Copilot

Pull Request Overview

This PR adds a semi-hidden feature that allows model developers to record custom metrics at the end of each training epoch. The implementation adds a new event handler that checks for a log_epoch_metrics method on the model and logs the returned metrics to both TensorBoard and MLFlow.

Key changes:

Added log_epoch_metrics event handler triggered on HYRAX_EPOCH_COMPLETED
Automatically detects if a model implements log_epoch_metrics() method via hasattr check
Logs returned metrics to TensorBoard (training/training/epoch/{m}) and MLFlow (training/epoch/{m})

Copilot · 2025-11-20T20:11:32Z

src/hyrax/pytorch_ignite.py

+            for m in epoch_metrics:
+                tensorboardx_logger.add_scalar(
+                    f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number
+                )
+                mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number)


The code assumes model.log_epoch_metrics() returns a dictionary but doesn't verify this. If the method returns None, [], or a non-iterable value, the iteration on line 725 will raise a TypeError. Consider adding a check:

if hasattr(model, "log_epoch_metrics"): epoch_number = trainer.state.epoch epoch_metrics = model.log_epoch_metrics() if epoch_metrics: # or: if isinstance(epoch_metrics, dict) for m in epoch_metrics: # ... logging code

This defensive check is especially important for a "semi-hidden feature" where users may not follow the expected interface strictly.

Suggested change

for m in epoch_metrics:

tensorboardx_logger.add_scalar(

f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number

)

mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number)

if isinstance(epoch_metrics, dict) and epoch_metrics:

for m in epoch_metrics:

tensorboardx_logger.add_scalar(

f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number

)

mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number)

codecov · 2025-11-20T20:14:18Z

Codecov Report

❌ Patch coverage is 37.50000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.91%. Comparing base (fff7213) to head (f12220e).
⚠️ Report is 92 commits behind head on main.

Files with missing lines	Patch %	Lines
src/hyrax/pytorch_ignite.py	37.50%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #540      +/-   ##
==========================================
- Coverage   54.94%   54.91%   -0.03%     
==========================================
  Files          51       51              
  Lines        5007     5015       +8     
==========================================
+ Hits         2751     2754       +3     
- Misses       2256     2261       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dougbrn

This looks good for rapid turnaround on applecider incubator, I think this should evolve into something more tunable within the train_step eventually so as to avoid the train_step needing modification on top of the definition of this special method.

github-actions · 2025-11-20T20:45:45Z

Before [`fff7213`]	After [`e52910d`]	Ratio	Benchmark (Parameter)
1.58G	1.6G	1.02	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
7.04±0.02s	7.18±0.04s	1.02	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
390±2ms	396±1ms	1.02	vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'qdrant')
1.83±0.03s	1.85±0.02s	1.01	benchmarks.time_database_connection_help
44.4±0.1ms	44.7±0.4ms	1.01	benchmarks.time_nb_obj_construct
1.81±0.02s	1.82±0.01s	1.01	benchmarks.time_save_to_database_help
1.83±0.01s	1.85±0.02s	1.01	benchmarks.time_train_help
1.83±0.02s	1.85±0.03s	1.01	benchmarks.time_umap_help
1.84±0.01s	1.86±0.03s	1.01	benchmarks.time_visualize_help
1.47±0.01s	1.48±0.01s	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'chromadb')

Click here to view all benchmarks.

Adding a hook to allow model builders to record metrics at the end of…

f12220e

… each epoch.

drewoldag self-assigned this Nov 20, 2025

drewoldag requested review from Copilot, dougbrn and mtauraso November 20, 2025 20:08

Copilot started reviewing on behalf of drewoldag November 20, 2025 20:09 View session

Copilot finished reviewing on behalf of drewoldag November 20, 2025 20:11

Copilot AI reviewed Nov 20, 2025

View reviewed changes

dougbrn approved these changes Nov 20, 2025

View reviewed changes

drewoldag merged commit 0575720 into main Nov 20, 2025
14 of 16 checks passed

drewoldag deleted the awo/emit-metrics-on-epoch-complete branch November 20, 2025 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add semi-hidden feature to allow recording metrics at the end of each training epoch#540

Add semi-hidden feature to allow recording metrics at the end of each training epoch#540
drewoldag merged 1 commit intomainfrom
awo/emit-metrics-on-epoch-complete

drewoldag commented Nov 20, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

codecov bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

dougbrn left a comment

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

drewoldag commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dougbrn left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drewoldag commented Nov 20, 2025 •

edited

Loading

codecov bot commented Nov 20, 2025 •

edited

Loading