Skip to content

Conversation

@fynnsu
Copy link
Collaborator

@fynnsu fynnsu commented Apr 25, 2025

Adds a general metric logger format with support for logging to json file, wandb, tensorboard, and the existing async logger.

User can select the --logger_type:

  • async: uses the existing AsyncStructuredLogger, this is the default option and default file names are the same for backwards compatibility reasons
  • file: basic jsonl file logger (essentially same as async but synchronous)
  • tensorboard: Uses PyTorch's torch.utils.tensorboard.SummaryWriter to write logs in tensorboard format.
  • wandb: logs to wandb. Currently untested (I don't have an account and need to set that up first)

User can also specify --run_name as a string. Instances of {rank}, {local_rank}, and {time} in the string will be replaced with their respective value.

e.g. {time}_rank{rank} -> 2025-04-25T17:26:01.477437_rank0

@mergify mergify bot added the ci-failure label Apr 25, 2025
@fynnsu fynnsu force-pushed the general_logging branch 2 times, most recently from f431306 to 3834c21 Compare April 25, 2025 21:49
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! A few things need to be fixed, but overall looks good!

Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main points:

  1. Consider using logging module for file management (only use a custom Formatter for JSONL).
  2. Define and enforce the form of the input dicts ("a recursive dict of string values")

We will also need some unit tests to validate the new addition. Overall, this looks like a very good start. Kudos.


try:
# Third Party
import wandb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be also declared via optional-dependencies group in pyproject.toml?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is best here, so will defer to the team. However, my reasons for not including it are:

  1. We will potentially have support for 3-4 different logging libraries and would then need an optional dependency for each.
  2. Each of these dependency groups would be one package each. It isn't necessarily easier or more logical to do pip install instructlab-training[wandb] than it is to do pip install instructlab-training wandb
  3. It should be relatively clear to the user that they need to install wandb to use the WandbLogger and if not the error message should clarify that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen packages with dozens of these one-entry dependencies. :) What it gives you is being able to request particular versions of libraries if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine. For PyTorch you still have to install tensorboard separately even though it's technically part of the API. So I don't know if we need a one-off optional requirement just for wandb, but if there are other packages that we need to install to make it work then it could make sense.

"""Create and initialize a logger of the specified type.
Args:
logger_type: Type of logger to create (must be one of ["file", "wandb", "tensorboard", "async"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to construct possible options programmatically to avoid drift. That said, you can probably enforce "one of" semantics with a type hint using enum constructed from allowed options.

)
parser.add_argument("--log_level", type=str, default="INFO")
parser.add_argument("--run_name", type=str, default=None)
parser.add_argument("--logger_type", type=str, default="async")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice it's better to avoid this though, I find choices to be super clunky and not as easy to work with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate? As long as choices are calculated (from a enum or dict keys), one doesn't need to touch the argument at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally agree w/ Ihar- if we maintained a mapping of available logger types in instructlab.training.logger we could generate the list of logging backends available dynamically. That's self-documenting and easier for a user to inspect via the --help message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Granted, that we even have argparse as an interface to the library is not optimal and we should probably get rid of it, switching to proper function arguments.)

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the requested changes and checking functionality of TensorBoard + Wandb. LGTM!

@booxter
Copy link
Contributor

booxter commented Apr 28, 2025

One other scenario that may be useful is being able to use multiple log destinations at the same time. This would allow to collect same metrics in multiple formats and use preferred tools for different analysis. This is where integration with python logging module could be also helpful since it allows to define multiple destinations for the same messages through propagate.

I'm thinking of enabling all these loggers in CI training runs and collecting all of the outputs as github artifacts.

@mergify mergify bot added the one-approval label Apr 28, 2025
@fynnsu
Copy link
Collaborator Author

fynnsu commented Apr 28, 2025

One other scenario that may be useful is being able to use multiple log destinations at the same time. This would allow to collect same metrics in multiple formats and use preferred tools for different analysis. This is where integration with python logging module could be also helpful since it allows to define multiple destinations for the same messages through propagate.

I'm thinking of enabling all these loggers in CI training runs and collecting all of the outputs as github artifacts.

Yeah that's something I've been thinking about. It would also be very easy to just implement a "MultiLogger" class that just loops through its nested loggers. I will look into the logging module more and try to see if it would work well with wandb/tensorboard. I do think it is useful to have the "metric logger" separate from the regular run logging, so I would want to make sure it's possible to do that while using the logging module for regular logs.

@RobotSail
Copy link
Member

@fynnsu Yes please do that if you can. I would keep the implementation simple (re-use the existing code you've already written ) and just make it so it does the existing AsyncStructuredLogger as a default and then everything else can just be added on after-the-fact. This way we can still retain log data even when using TensorBoard or anything else.

@mergify mergify bot added dependencies Pull requests that update a dependency file and removed ci-failure labels Apr 30, 2025
@fynnsu fynnsu force-pushed the general_logging branch 4 times, most recently from 777a5eb to 908cb5e Compare May 2, 2025 13:38
@mergify mergify bot added the ci-failure label May 2, 2025
@mergify
Copy link
Contributor

mergify bot commented May 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @fynnsu please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 2, 2025
)
parser.add_argument("--log_level", type=str, default="INFO")
parser.add_argument("--run_name", type=str, default=None)
parser.add_argument("--logger_type", type=str, default="async")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Granted, that we even have argparse as an interface to the library is not optimal and we should probably get rid of it, switching to proper function arguments.)

)
```
"""
if not loggers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree on raising since it's a valid input. What it should probably mean - if doesn't already - is that all previously set loggers should be disabled.

Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is good to go. Some extra care with test env cleanup is advised (restore env vars; clean up loggers), and some usage document as requested by James (and me before, though perhaps I wasn't explicit as to what I ask for). Otherwise I'm ready to merge this in.

@fynnsu fynnsu force-pushed the general_logging branch from d0b3e37 to 8713cd4 Compare May 8, 2025 20:49
@mergify mergify bot added the ci-failure label May 8, 2025
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
@mergify mergify bot removed the ci-failure label May 8, 2025
@mergify mergify bot added the ci-failure label May 12, 2025
@fynnsu fynnsu force-pushed the general_logging branch 2 times, most recently from 5f47d0a to 55a9275 Compare May 12, 2025 14:46
@mergify mergify bot removed the ci-failure label May 12, 2025
@fynnsu
Copy link
Collaborator Author

fynnsu commented May 12, 2025

@booxter @JamesKunstle I've added a docs/logging.md file that describes both stdlib logging and its integration into instructlab.training. Let me know if anything is unclear or needs further explanation.

Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pulling this off and addressing all the comments (sometimes misleading!) I like how this functionality integrates with stdlib logging approach. We should strive to be pythonic.

python src/instructlab/training/main_ds.py \
... \
--run_name "my_run" \
--logger_type "async,tensorboard,wandb" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love how easy it is to do multi-backend, or implement another backend. ❤️

@mergify mergify bot removed the one-approval label May 12, 2025
@booxter booxter requested a review from JamesKunstle May 12, 2025 14:53
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
@fynnsu fynnsu force-pushed the general_logging branch from 55a9275 to e283c9a Compare May 12, 2025 14:57
@mergify mergify bot added ci-failure and removed ci-failure labels May 12, 2025
Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, very excited to have this!

@JamesKunstle JamesKunstle merged commit 7682500 into instructlab:main May 12, 2025
16 checks passed
@fynnsu fynnsu deleted the general_logging branch May 12, 2025 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants