Skip to content

Trend Anomalies in Ruler V2 #1143

@zhejunliux

Description

@zhejunliux

Content

When using GPT-4.1 to evaluate data with 256k and 512k configurations in Ruler V2, the following two issues were observed:

  1. nv-medium scores higher than easy. 67 vs 64, 62 vs 50
  2. qa_hard scores higher than qa-medium. 84 vs 81, 83 vs 79

trend

256k
 mk_niah
|████ Basic (100%) |████ Easy (94.83%) |███ Medium (90%) |██ Hard (84%)

mv_niah
|████ Basic (84.25%) |███ Easy (64.67%) |███ Medium (67.72%) |█ Hard (45.12%)

qa
|████ Basic (100%) |████ Easy (88%) |███ Medium (81%) |██ Hard (84%)


512k
 mk_niah
|████ Basic (98%) |████ Easy (90.44%) |███ Medium (85%) |██ Hard (82%)

 mv_niah
|███ Basic (74.25%) |█ Easy (50.57%) |██ Medium (62.71%) |▌ Hard (43.87%)

 qa
|████ Basic (97%) |███ Easy (81%) |██ Medium (79%) |██ Hard (83%)

hello @hsiehjackson Are there anomalies in the data generation strategy or evaluation metrics in nemo_skills/dataset/ruler2/prepare.py?

How to close

  1. How can this issue be fixed to restore normal trends?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions