Skip to content

Newmetric: ClassificationReport #3116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: master
Choose a base branch
from
Open

Conversation

aymuos15
Copy link

@aymuos15 aymuos15 commented Jun 3, 2025

What does this PR do?

Fixes #2580

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - Yes, details in issue number mentioned.
  • Did you read the contributor guideline, Pull Request section? - Yes.
  • Did you make sure to update the docs? - Not yet. Can do once everything is finalised in terms of code.
  • Did you write any new necessary tests? - Yes
PR review

Classification report is a nice to have metric. The motivation can be easily inspired from its scikit learn integration. Having it ready for tensors is definitely a good step forward.

Notes:

  1. The original issue talks about having the option to include more metrics. I personally feel that is not required as this should be a quick stand alone way to get a lot of relevant (and usually required metrics info) quickly in a nice presentable way.
  2. Maybe we need a more in depth discussion to expand the testing. I do not have access to a distributed mission at the moment, but if it is paramount, I can look into it.

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Was a really nice way of getting familiar with the codebase :)


📚 Documentation preview 📚: https://torchmetrics--3116.org.readthedocs.build/en/3116/

@Borda
Copy link
Member

Borda commented Jun 4, 2025

Great addition!
pls check failing due to a circular import

aymuos15 and others added 5 commits June 6, 2025 11:46
- Remove direct imports of classification metrics at module level
- Implement lazy imports using @Property for Binary/Multiclass/MultilabelClassificationReport
- Move metric initialization to property getters to break circular dependencies
- Maintain all existing functionality while improving import structure
@aymuos15
Copy link
Author

aymuos15 commented Jun 6, 2025

Thank you very much! @Borda

Could you please help me understand why the TM.unittests are failing? They seemed to be linked with azure and I am not familiar with the error logs there.

Regarding the ruff failures, for some reason my local checks pass, so ill look into that. But is it okay if I tend to those once the actual code is vetted?

Sorry for the delay in getting back to, I had some immediate personal work to get done.

CC: @rittik9 I think it is ready now?

Copy link
Contributor

@rittik9 rittik9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aymuos15 I had a quick look and opened a pr for fixing pre-commit errors.
Doctests regarding classification report seem to be failing, mind having look.
going to take a detailed look later.

@aymuos15
Copy link
Author

aymuos15 commented Jun 6, 2025

Thanks a lot for the quickfix @rittik9.

Will look into the doctests now.

Sorry about the close and open, i think the close automatically got triggered post the local merge for some reason? not sure.

Copy link
Contributor

@rittik9 rittik9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aymuos15 pls update the init.py files in the functional interface and docs.
You can refer #3090

)


def multiclass_classification_report(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also provide top_k for multiclass?

Copy link
Contributor

@rittik9 rittik9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the unnecessary comments

@aymuos15
Copy link
Author

Thank you very much for the review @rittik9.

I think most of them need significant change to be implemented. I will get this done over the weekend. Will make sure to remove the bad comments as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of modular interface , while computing , why not call the functions from functional interface, reducing boilerplate

@rittik9
Copy link
Contributor

rittik9 commented Jun 12, 2025

Hi @aymuos15 overall it looks pretty good to me, nice work. Just added a few comments. You can also update torchmetrics/src/torchmetrics/functional/__init__.py file.

aymuos15 and others added 3 commits June 17, 2025 14:23
…r accuracy | Revamp testing to make it more parametrised | Make the non functional part of metric more modular | add ignore_index support | top_k support for multiclass
@aymuos15
Copy link
Author

@rittik9 Thanks again for the detailed review.

I have tried to address everything bar the classwise accuracy score. That is inherently different from the scikit version of the metric. Are you okay if we raise a separate issue for that, discuss how exactly we want to add that, and then do a separate PR for that?

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good, just think if it would be better if we can have it rather more as a wrapper so you can select what metrics you want to have in the report (probably pass it s string metrics/class names) and then have precision, recall, F-measure as default configuration

see suggestion by @rittik9 in #2580 (comment)

@aymuos15
Copy link
Author

Thank you very much for the review.

I definitely like the idea and agree with it.

However, I think I would have to pretty much revamp everything. Would it be okay to keep this as the base and proceed with that separately in another PR or should I just continue here?

@Borda
Copy link
Member

Borda commented Jun 18, 2025

However, I think I would have to pretty much revamp everything. Would it be okay to keep this as the base and proceed with that separately in another PR or should I just continue here?

you can continue here 🐰

@aymuos15
Copy link
Author

aymuos15 commented Jun 18, 2025

Haha, alright. Ill take some time to understand the best way to approach this and make the tests so edge cases do not come up. Thanks again!

@rittik9
Copy link
Contributor

rittik9 commented Jun 18, 2025

Hi @aymuos15 While you're at it, could you please ensure that the 'micro' averaging option is included by default for all three cases—binary, multiclass, and multilabel? In case the results don't align with sklearn, we can always verify them manually.

@aymuos15
Copy link
Author

Yup sure. Thank you for reminding me.

Since anyways now we are going to include everything, ill make all options exhaustive.

@Borda
Copy link
Member

Borda commented Jun 23, 2025

Since anyways now we are going to include everything, ill make all options exhaustive.

We can figure out how to split it into smaller pieces so it would land smoother/faster

@aymuos15
Copy link
Author

That sounds good with me. Could you please let me know what that would entail?

This is what i had in mind:

TorchMetrics Classification Report Examples

============================================================
 BINARY CLASSIFICATION
============================================================
Predictions: [0, 1, 1, 0, 1, 0, 1, 1]
True Labels: [0, 1, 0, 0, 1, 1, 1, 0]

--- CASE 1: DEFAULT METRICS (precision, recall, f1-score) ---
             precision    recall  f1-score   support 

           0      0.67      0.50      0.57         4 
           1      0.60      0.75      0.67         4 

    accuracy                          0.62         8 
   micro avg      0.62      0.62      0.62         8 
   macro avg      0.63      0.62      0.62         8 
weighted avg      0.63      0.62      0.62         8 

--- CASE 2: ADD ACCURACY TO DEFAULT METRICS ---
             precision    recall  f1-score  accuracy   support 

           0      0.67      0.50      0.57      0.50         4 
           1      0.60      0.75      0.67      0.75         4 

    accuracy                                    0.62         8 
   micro avg      0.62      0.62      0.62      0.62         8 
   macro avg      0.63      0.62      0.62      0.62         8 
weighted avg      0.63      0.62      0.62      0.62         8 

--- CASE 3: SPECIFICITY ONLY ---
             specificity     support 

           0        0.75           4 
           1        0.50           4 

    accuracy        0.62           8 
   micro avg        0.50           8 
   macro avg        0.62           8 
weighted avg        0.62           8 

Verification - Direct accuracy calculation: 0.6250

============================================================
 MULTICLASS CLASSIFICATION
============================================================
Predictions: [0, 2, 1, 2, 0, 1, 2, 0]
True Labels: [0, 1, 1, 2, 0, 2, 2, 1]

--- CASE 1: DEFAULT METRICS ---
             precision    recall  f1-score   support 

           0      0.67      1.00      0.80         2 
           1      0.50      0.33      0.40         3 
           2      0.67      0.67      0.67         3 

    accuracy                          0.62         8 
   micro avg      0.62      0.62      0.62         8 
   macro avg      0.61      0.67      0.62         8 
weighted avg      0.60      0.63      0.60         8 

--- CASE 2: ADD ACCURACY TO DEFAULT METRICS ---
             precision    recall  f1-score  accuracy   support 

           0      0.67      1.00      0.80      1.00         2 
           1      0.50      0.33      0.40      0.33         3 
           2      0.67      0.67      0.67      0.67         3 

    accuracy                                    0.62         8 
   micro avg      0.62      0.62      0.62      0.62         8 
   macro avg      0.61      0.67      0.62      0.67         8 
weighted avg      0.60      0.63      0.60      0.63         8 

--- CASE 3: SPECIFICITY ONLY ---
             specificity     support 

           0        0.83           2 
           1        0.80           3 
           2        0.80           3 

    accuracy        0.62           8 
   micro avg        0.81           8 
   macro avg        0.81           8 
weighted avg        0.81           8 

Verification - Direct accuracy calculation: 0.6667

============================================================
 MULTILABEL CLASSIFICATION
============================================================
Predictions:
  Sample 1: [1, 0, 1]
  Sample 2: [0, 1, 0]
  Sample 3: [1, 1, 0]
  Sample 4: [0, 0, 1]
True Labels:
  Sample 1: [1, 0, 0]
  Sample 2: [0, 1, 1]
  Sample 3: [1, 1, 0]
  Sample 4: [0, 0, 1]

--- CASE 1: DEFAULT METRICS ---
             precision    recall  f1-score   support 

           0      1.00      1.00      1.00         2 
           1      1.00      1.00      1.00         2 
           2      0.50      0.50      0.50         2 

   micro avg      0.83      0.83      0.83         6 
   macro avg      0.83      0.83      0.83         6 
weighted avg      0.83      0.83      0.83         6 
 samples avg      0.88      0.88      0.83         6 

--- CASE 2: ADD ACCURACY TO DEFAULT METRICS ---
             precision    recall  f1-score  accuracy   support 

           0      1.00      1.00      1.00      1.00         2 
           1      1.00      1.00      1.00      1.00         2 
           2      0.50      0.50      0.50      0.50         2 

   micro avg      0.83      0.83      0.83      0.83         6 
   macro avg      0.83      0.83      0.83      0.83         6 
weighted avg      0.83      0.83      0.83      0.83         6 
 samples avg      0.88      0.88      0.83      0.83         6 

--- CASE 3: SPECIFICITY ONLY ---
             specificity     support 

           0        1.00           2 
           1        1.00           2 
           2        0.50           2 

   micro avg        0.83           6 
   macro avg        0.83           6 
weighted avg        0.83           6 
 samples avg                       6 

Verification - Direct accuracy calculation: 0.8333

Essentially

  1. If anyone uses it off the shelf, it is exactly like scikit-learn
  2. But then, we have the option to add whatever has been discussed till now as well.

Is the above okay?

I have not committed anything yet because the code is very messy. Once we agree on a path forward, I will trigger the next commit if thats okay.

@Borda
Copy link
Member

Borda commented Jun 24, 2025

I had rather in mind that we can trim this PR to keep the printing functions and table formatting, and in the following PR have extension of Collection metrics either as a new subclass or a new method...

@aymuos15
Copy link
Author

Ah okay! I will push a commit for the formatting and printing tonight. Thank you.

@aymuos15
Copy link
Author

@Borda

Keeping in mind the below rigidness for the micro avg

    # Determine if micro average should be shown in the report based on classification task
    # Following scikit-learn's logic:
    # - Show for multilabel classification (always)
    # - Show for multiclass when using a subset of classes
    # - Don't show for binary classification (micro avg is same as accuracy)
    # - Don't show for full multiclass classification with all classes (micro avg is same as accuracy)
    show_micro_avg = False
    is_multilabel = task == ClassificationTask.MULTILABEL

After going through everything again, I think the PR as of right now, it does exactly the bare minimum? The main additions after the initial few commits were the ignore_index and the micro avg. So Just a few quick questions:

  1. Should I remove them as well and keep them for the follow up PR?
  2. Do I keep the base of the collection metrics (as it is not implemented in the current commit) or leave that out as well?
  3. The test file has multiple test covering all the bases, does that need to change?

Thank you very much!

@Borda
Copy link
Member

Borda commented Jun 25, 2025

Still, the Module-like metric is not derived from the collection to save some computations
I would just leave the functional part as the module-like part needs to be redone

@aymuos15
Copy link
Author

Okay, Thank you. So just to confirm -- I would only push src/torchmetrics/classification/classification_report.py and its corresponding files and tests?

@mergify mergify bot removed the has conflicts label Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation topic: Classif
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Classification Report, maybe with the option to select even more metrics
3 participants