Skip to content

evaluation/summary.py only aggregates the first split in each MTEB result file #187

Description

@lnxtree

Hi Qwen team, thanks for releasing the Qwen3 Embedding evaluation scripts.

I noticed that evaluation/summary.py appears to aggregate only the first split found in each MTEB result JSON file, while the official MTEB leaderboard/report aggregation uses the complete result structure.

In the current script:

eval_split = list(result['scores'].keys())[0]
score = sum([ele['main_score'] for ele in result['scores'][eval_split]]) / len(result['scores'][eval_split])

This makes the reported score depend on JSON key order and ignores any remaining splits under result['scores']. For tasks whose result files contain more than one split, the local summary can therefore diverge from the official MTEB aggregation.

Reproduction

I ran the current summary.py on the Qwen3-Embedding-8B MTEB multilingual result directory:

python evaluation/summary.py \
  results/Qwen__Qwen3-Embedding-8B/4e423935c619ae4df87b646a3ce949610c66241c

Output summary:

missed tasks []
final score 131 70.74854961832064
Retrieval 18 0.7186003327533731
Classification 43 0.7397151816228289
Clustering 16 0.5768753358581853
Reranking 6 0.6631180416666668
PairClassification 11 0.8631857846320348
BitextMining 13 0.8089235039752247
MultilabelClassification 5 0.28655478260869566
InstructionRetrieval 3 0.10064133333333332
STS 16 0.8114725920138889
Mean(Type) 0.6187874320515812

The official Qwen model card / report table for Qwen3-Embedding-8B on MTEB (Multilingual) reports different values:

Metric Current summary.py output Official report / MTEB leaderboard
Mean (Task) 70.75 70.58
Mean (Type) 61.88 61.69
Bitext Mining 80.89 80.89
Classification 73.97 74.00
Clustering 57.69 57.65
Instruction Retrieval 10.06 10.06
Multilabel Classification 28.66 28.66
Pair Classification 86.32 86.40
Reranking 66.31 65.63
Retrieval 71.86 70.88
STS 81.15 81.08

References

Expected behavior

summary.py should aggregate result JSON files in the same way as official MTEB, instead of selecting only list(result['scores'].keys())[0].

A possible fix would be to iterate over all relevant splits in result['scores'] and apply the same split/task/subset aggregation policy used by MTEB.

Could you please check whether summary.py should be updated to match the official MTEB aggregation? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions