Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds a semi-hidden feature that allows model developers to record custom metrics at the end of each training epoch. The implementation adds a new event handler that checks for a log_epoch_metrics method on the model and logs the returned metrics to both TensorBoard and MLFlow.
Key changes:
- Added
log_epoch_metricsevent handler triggered onHYRAX_EPOCH_COMPLETED - Automatically detects if a model implements
log_epoch_metrics()method viahasattrcheck - Logs returned metrics to TensorBoard (
training/training/epoch/{m}) and MLFlow (training/epoch/{m})
| for m in epoch_metrics: | ||
| tensorboardx_logger.add_scalar( | ||
| f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number | ||
| ) | ||
| mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number) |
There was a problem hiding this comment.
The code assumes model.log_epoch_metrics() returns a dictionary but doesn't verify this. If the method returns None, [], or a non-iterable value, the iteration on line 725 will raise a TypeError. Consider adding a check:
if hasattr(model, "log_epoch_metrics"):
epoch_number = trainer.state.epoch
epoch_metrics = model.log_epoch_metrics()
if epoch_metrics: # or: if isinstance(epoch_metrics, dict)
for m in epoch_metrics:
# ... logging codeThis defensive check is especially important for a "semi-hidden feature" where users may not follow the expected interface strictly.
| for m in epoch_metrics: | |
| tensorboardx_logger.add_scalar( | |
| f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number | |
| ) | |
| mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number) | |
| if isinstance(epoch_metrics, dict) and epoch_metrics: | |
| for m in epoch_metrics: | |
| tensorboardx_logger.add_scalar( | |
| f"training/training/epoch/{m}", epoch_metrics[m], global_step=epoch_number | |
| ) | |
| mlflow.log_metrics({f"training/epoch/{m}": epoch_metrics[m]}, step=epoch_number) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #540 +/- ##
==========================================
- Coverage 54.94% 54.91% -0.03%
==========================================
Files 51 51
Lines 5007 5015 +8
==========================================
+ Hits 2751 2754 +3
- Misses 2256 2261 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dougbrn
left a comment
There was a problem hiding this comment.
This looks good for rapid turnaround on applecider incubator, I think this should evolve into something more tunable within the train_step eventually so as to avoid the train_step needing modification on top of the definition of this special method.
Click here to view all benchmarks. |
This change add a new event handler to the training engine that will look for a model method named
log_epoch_metrics. If found, it will call the method on the model at the end of each training epoch.It is expected (though not enforced) that
log_epoch_metricswill return a dictionary similar to whattrain_stepreturns. We will then log each key of the dictionary undertraining/training/epoch/<foo>in TensorBoard ortraining/epoch/<foo>in MLFlow.The anticipated use case is that model developers will accumulate values in
train_stepover the course of an epoch, then use those values to calculate a metric at the end of the epoch before resetting the accumulators.Just a simple example after implementing

log_epoch_metricsin HyraxAutoencoder that returns 3 "metrics" with random values over the course of training for 5 epochs.