Skip to content

Add semi-hidden feature to allow recording metrics at the end of each training epoch#540

Merged
drewoldag merged 1 commit intomainfrom
awo/emit-metrics-on-epoch-complete
Nov 20, 2025
Merged

Add semi-hidden feature to allow recording metrics at the end of each training epoch#540
drewoldag merged 1 commit intomainfrom
awo/emit-metrics-on-epoch-complete

Conversation

@drewoldag
Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag commented Nov 20, 2025

This change add a new event handler to the training engine that will look for a model method named log_epoch_metrics. If found, it will call the method on the model at the end of each training epoch.

It is expected (though not enforced) that log_epoch_metrics will return a dictionary similar to what train_step returns. We will then log each key of the dictionary under training/training/epoch/<foo> in TensorBoard or training/epoch/<foo> in MLFlow.

The anticipated use case is that model developers will accumulate values in train_step over the course of an epoch, then use those values to calculate a metric at the end of the epoch before resetting the accumulators.

Just a simple example after implementing log_epoch_metrics in HyraxAutoencoder that returns 3 "metrics" with random values over the course of training for 5 epochs.
Screenshot 2025-11-20 at 12 04 43 PM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a semi-hidden feature that allows model developers to record custom metrics at the end of each training epoch. The implementation adds a new event handler that checks for a log_epoch_metrics method on the model and logs the returned metrics to both TensorBoard and MLFlow.

Key changes:

  • Added log_epoch_metrics event handler triggered on HYRAX_EPOCH_COMPLETED
  • Automatically detects if a model implements log_epoch_metrics() method via hasattr check
  • Logs returned metrics to TensorBoard (training/training/epoch/{m}) and MLFlow (training/epoch/{m})

Comment on lines +725 to +729
for m in epoch_metrics:
tensorboardx_logger.add_scalar(
f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number
)
mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number)
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code assumes model.log_epoch_metrics() returns a dictionary but doesn't verify this. If the method returns None, [], or a non-iterable value, the iteration on line 725 will raise a TypeError. Consider adding a check:

if hasattr(model, "log_epoch_metrics"):
    epoch_number = trainer.state.epoch
    epoch_metrics = model.log_epoch_metrics()
    if epoch_metrics:  # or: if isinstance(epoch_metrics, dict)
        for m in epoch_metrics:
            # ... logging code

This defensive check is especially important for a "semi-hidden feature" where users may not follow the expected interface strictly.

Suggested change
for m in epoch_metrics:
tensorboardx_logger.add_scalar(
f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number
)
mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number)
if isinstance(epoch_metrics, dict) and epoch_metrics:
for m in epoch_metrics:
tensorboardx_logger.add_scalar(
f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number
)
mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number)

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 37.50000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.91%. Comparing base (fff7213) to head (f12220e).
⚠️ Report is 92 commits behind head on main.

Files with missing lines Patch % Lines
src/hyrax/pytorch_ignite.py 37.50% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #540      +/-   ##
==========================================
- Coverage   54.94%   54.91%   -0.03%     
==========================================
  Files          51       51              
  Lines        5007     5015       +8     
==========================================
+ Hits         2751     2754       +3     
- Misses       2256     2261       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for rapid turnaround on applecider incubator, I think this should evolve into something more tunable within the train_step eventually so as to avoid the train_step needing modification on top of the definition of this special method.

@github-actions
Copy link
Copy Markdown

Before [fff7213] After [e52910d] Ratio Benchmark (Parameter)
1.58G 1.6G 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
7.04±0.02s 7.18±0.04s 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
390±2ms 396±1ms 1.02 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'qdrant')
1.83±0.03s 1.85±0.02s 1.01 benchmarks.time_database_connection_help
44.4±0.1ms 44.7±0.4ms 1.01 benchmarks.time_nb_obj_construct
1.81±0.02s 1.82±0.01s 1.01 benchmarks.time_save_to_database_help
1.83±0.01s 1.85±0.02s 1.01 benchmarks.time_train_help
1.83±0.02s 1.85±0.03s 1.01 benchmarks.time_umap_help
1.84±0.01s 1.86±0.03s 1.01 benchmarks.time_visualize_help
1.47±0.01s 1.48±0.01s 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'chromadb')

Click here to view all benchmarks.

@drewoldag drewoldag merged commit 0575720 into main Nov 20, 2025
14 of 16 checks passed
@drewoldag drewoldag deleted the awo/emit-metrics-on-epoch-complete branch November 20, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants