Hi Qwen team, thanks for releasing the Qwen3 Embedding evaluation scripts.
I noticed that evaluation/summary.py appears to aggregate only the first split found in each MTEB result JSON file, while the official MTEB leaderboard/report aggregation uses the complete result structure.
In the current script:
eval_split = list(result['scores'].keys())[0]
score = sum([ele['main_score'] for ele in result['scores'][eval_split]]) / len(result['scores'][eval_split])
This makes the reported score depend on JSON key order and ignores any remaining splits under result['scores']. For tasks whose result files contain more than one split, the local summary can therefore diverge from the official MTEB aggregation.
Reproduction
I ran the current summary.py on the Qwen3-Embedding-8B MTEB multilingual result directory:
python evaluation/summary.py \
results/Qwen__Qwen3-Embedding-8B/4e423935c619ae4df87b646a3ce949610c66241c
Output summary:
missed tasks []
final score 131 70.74854961832064
Retrieval 18 0.7186003327533731
Classification 43 0.7397151816228289
Clustering 16 0.5768753358581853
Reranking 6 0.6631180416666668
PairClassification 11 0.8631857846320348
BitextMining 13 0.8089235039752247
MultilabelClassification 5 0.28655478260869566
InstructionRetrieval 3 0.10064133333333332
STS 16 0.8114725920138889
Mean(Type) 0.6187874320515812
The official Qwen model card / report table for Qwen3-Embedding-8B on MTEB (Multilingual) reports different values:
| Metric |
Current summary.py output |
Official report / MTEB leaderboard |
| Mean (Task) |
70.75 |
70.58 |
| Mean (Type) |
61.88 |
61.69 |
| Bitext Mining |
80.89 |
80.89 |
| Classification |
73.97 |
74.00 |
| Clustering |
57.69 |
57.65 |
| Instruction Retrieval |
10.06 |
10.06 |
| Multilabel Classification |
28.66 |
28.66 |
| Pair Classification |
86.32 |
86.40 |
| Reranking |
66.31 |
65.63 |
| Retrieval |
71.86 |
70.88 |
| STS |
81.15 |
81.08 |
References
Expected behavior
summary.py should aggregate result JSON files in the same way as official MTEB, instead of selecting only list(result['scores'].keys())[0].
A possible fix would be to iterate over all relevant splits in result['scores'] and apply the same split/task/subset aggregation policy used by MTEB.
Could you please check whether summary.py should be updated to match the official MTEB aggregation? Thanks!
Hi Qwen team, thanks for releasing the Qwen3 Embedding evaluation scripts.
I noticed that
evaluation/summary.pyappears to aggregate only the first split found in each MTEB result JSON file, while the official MTEB leaderboard/report aggregation uses the complete result structure.In the current script:
This makes the reported score depend on JSON key order and ignores any remaining splits under
result['scores']. For tasks whose result files contain more than one split, the local summary can therefore diverge from the official MTEB aggregation.Reproduction
I ran the current
summary.pyon theQwen3-Embedding-8BMTEB multilingual result directory:Output summary:
The official Qwen model card / report table for
Qwen3-Embedding-8BonMTEB (Multilingual)reports different values:summary.pyoutputReferences
BenchmarkResults.to_dataframe(...): https://github.com/embeddings-benchmark/mteb/blob/main/mteb/benchmarks/_create_table.pyExpected behavior
summary.pyshould aggregate result JSON files in the same way as official MTEB, instead of selecting onlylist(result['scores'].keys())[0].A possible fix would be to iterate over all relevant splits in
result['scores']and apply the same split/task/subset aggregation policy used by MTEB.Could you please check whether
summary.pyshould be updated to match the official MTEB aggregation? Thanks!