evaluation/summary.py only aggregates the first split in each MTEB result file

Hi Qwen team, thanks for releasing the Qwen3 Embedding evaluation scripts.

I noticed that `evaluation/summary.py` appears to aggregate only the first split found in each MTEB result JSON file, while the official MTEB leaderboard/report aggregation uses the complete result structure.

In the current script:

```python
eval_split = list(result['scores'].keys())[0]
score = sum([ele['main_score'] for ele in result['scores'][eval_split]]) / len(result['scores'][eval_split])
```

This makes the reported score depend on JSON key order and ignores any remaining splits under `result['scores']`. For tasks whose result files contain more than one split, the local summary can therefore diverge from the official MTEB aggregation.

**Reproduction**

I ran the current `summary.py` on the `Qwen3-Embedding-8B` MTEB multilingual result directory:

```bash
python evaluation/summary.py \
  results/Qwen__Qwen3-Embedding-8B/4e423935c619ae4df87b646a3ce949610c66241c
```

Output summary:

```text
missed tasks []
final score 131 70.74854961832064
Retrieval 18 0.7186003327533731
Classification 43 0.7397151816228289
Clustering 16 0.5768753358581853
Reranking 6 0.6631180416666668
PairClassification 11 0.8631857846320348
BitextMining 13 0.8089235039752247
MultilabelClassification 5 0.28655478260869566
InstructionRetrieval 3 0.10064133333333332
STS 16 0.8114725920138889
Mean(Type) 0.6187874320515812
```

The official Qwen model card / report table for `Qwen3-Embedding-8B` on `MTEB (Multilingual)` reports different values:

| Metric | Current `summary.py` output | Official report / MTEB leaderboard |
| --- | ---: | ---: |
| Mean (Task) | 70.75 | 70.58 |
| Mean (Type) | 61.88 | 61.69 |
| Bitext Mining | 80.89 | 80.89 |
| Classification | 73.97 | 74.00 |
| Clustering | 57.69 | 57.65 |
| Instruction Retrieval | 10.06 | 10.06 |
| Multilabel Classification | 28.66 | 28.66 |
| Pair Classification | 86.32 | 86.40 |
| Reranking | 66.31 | 65.63 |
| Retrieval | 71.86 | 70.88 |
| STS | 81.15 | 81.08 |

**References**

- Current script: https://github.com/QwenLM/Qwen3-Embedding/blob/main/evaluation/summary.py
- Qwen3-Embedding-8B model card table: https://huggingface.co/Qwen/Qwen3-Embedding-8B#evaluation
- MTEB table aggregation uses `BenchmarkResults.to_dataframe(...)`: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/benchmarks/_create_table.py

**Expected behavior**

`summary.py` should aggregate result JSON files in the same way as official MTEB, instead of selecting only `list(result['scores'].keys())[0]`.

A possible fix would be to iterate over all relevant splits in `result['scores']` and apply the same split/task/subset aggregation policy used by MTEB.

Could you please check whether `summary.py` should be updated to match the official MTEB aggregation? Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

evaluation/summary.py only aggregates the first split in each MTEB result file #187

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Current `summary.py` output	Official report / MTEB leaderboard
Mean (Task)	70.75	70.58
Mean (Type)	61.88	61.69
Bitext Mining	80.89	80.89
Classification	73.97	74.00
Clustering	57.69	57.65
Instruction Retrieval	10.06	10.06
Multilabel Classification	28.66	28.66
Pair Classification	86.32	86.40
Reranking	66.31	65.63
Retrieval	71.86	70.88
STS	81.15	81.08

Uh oh!

evaluation/summary.py only aggregates the first split in each MTEB result file #187

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions