-
Notifications
You must be signed in to change notification settings - Fork 141
Open
Description
Content
When using GPT-4.1 to evaluate data with 256k and 512k configurations in Ruler V2, the following two issues were observed:
- nv-medium scores higher than easy. 67 vs 64, 62 vs 50
- qa_hard scores higher than qa-medium. 84 vs 81, 83 vs 79
trend
256k
mk_niah
|████ Basic (100%) |████ Easy (94.83%) |███ Medium (90%) |██ Hard (84%)
mv_niah
|████ Basic (84.25%) |███ Easy (64.67%) |███ Medium (67.72%) |█ Hard (45.12%)
qa
|████ Basic (100%) |████ Easy (88%) |███ Medium (81%) |██ Hard (84%)
512k
mk_niah
|████ Basic (98%) |████ Easy (90.44%) |███ Medium (85%) |██ Hard (82%)
mv_niah
|███ Basic (74.25%) |█ Easy (50.57%) |██ Medium (62.71%) |▌ Hard (43.87%)
qa
|████ Basic (97%) |███ Easy (81%) |██ Medium (79%) |██ Hard (83%)
hello @hsiehjackson Are there anomalies in the data generation strategy or evaluation metrics in nemo_skills/dataset/ruler2/prepare.py?
How to close
- How can this issue be fixed to restore normal trends?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels