Korean Vision Document Retrieval (KoViDoRe) benchmark for evaluating text-to-image retrieval models on Korean visual documents.
KoViDoRe is a comprehensive benchmark for evaluating Korean visual document retrieval capabilities. Built upon the foundation of ViDoRe, it assesses how well models can retrieve relevant Korean visual documents—including screenshots, presentation slides, and office documents—when given Korean text queries.
The KoViDoRe v1 encompasses 5 distinct tasks, each targeting different types of visual documents commonly found in Korean business and academic environments. This diverse task structure allows for thorough evaluation of multimodal retrieval performance across various document formats and content types.
The KoViDoRe v2 addresses a key limitation of KoViDoRe v1—single-page matching—by generating queries that require aggregating information across multiple pages. This benchmark consists of 4 distinct tasks targeting practical enterprise domains: cybersecurity, economic reports, energy documents, and HR materials.
| Subset | Description | Documents | Queries | Example Query | Link |
|---|---|---|---|---|---|
| HR | Workforce outlook and employment policy | 2,109 | 221 | 산업용 첨단화학소재 분야의 대졸 채용률, 구매·영업·시장조사 직무의 채용률, 생산기술 직무의 채용-퇴직 격차를 비교하여 인력 수급 불균형 원인을 분석하라. | 🤗 Dataset |
| Energy | Energy policy and power market trends | 1,993 | 173 | 액화석유가스 안전공급 계약제에서 체적판매방법과 중량판매방법의 최소 계약기간은 어떻게 다르며, 소비자보장책임보험의 최대 보상한도는 얼마인가요? | 🤗 Dataset |
| Economic | Quarterly economic trend reports | 1,477 | 163 | 2022년 원유 도입 단가 상승과 원/달러 환율 변동이 국내 회사채 수익률에 미친 영향을 비교 분석하라 | 🤗 Dataset |
| Cybersecurity | Cyber threat analysis and security guides | 1,150 | 149 | 네트워크 백업의 보안 취약점을 해결하기 위해 WORM 스토리지 기술이 어떻게 적용되는가? | 🤗 Dataset |
The following table shows performance across all KoViDoRe v1 tasks (ndcg@5 scores as percentages, sorted by Average):
| Model | Model Size | FinOCR | MIR | Office | Slide | VQA | Average |
|---|---|---|---|---|---|---|---|
| jinaai/jina-embeddings-v4 | 3800 | 94.1 | 73.6 | 88.7 | 89.7 | 86.3 | 86.5 |
| TomoroAI/tomoro-colqwen3-embed-8b | 8000 | 81.8 | 60.9 | 84.2 | 86.3 | 82.9 | 79.2 |
| nomic-ai/colnomic-embed-multimodal-7b | 7000 | 78.0 | 63.4 | 82.0 | 86.8 | 85.2 | 79.1 |
| ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1 | 7000 | 65.9 | 60.2 | 79.5 | 84.2 | 82.1 | 74.4 |
| nomic-ai/colnomic-embed-multimodal-3b | 3000 | 75.5 | 56.3 | 82.2 | 36.3 | 72.9 | 64.6 |
| vidore/colqwen2-v1.0 | 2210 | 61.6 | 44.0 | 56.7 | 66.0 | 67.5 | 59.2 |
| TomoroAI/tomoro-colqwen3-embed-4b | 4000 | 67.1 | 32.4 | 42.5 | 66.9 | 52.8 | 52.3 |
| vidore/colqwen2.5-v0.2 | 3000 | 45.0 | 48.0 | 62.2 | 25.6 | 68.0 | 49.8 |
| ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1 | 3000 | 43.0 | 37.5 | 53.7 | 24.8 | 62.3 | 44.3 |
| eagerworks/eager-embed-v1 | 4000 | 17.9 | 20.8 | 37.0 | 60.6 | 49.0 | 37.1 |
| vidore/colpali-v1.3 | 2920 | 38.2 | 14.9 | 23.6 | 50.7 | 30.0 | 31.5 |
| vidore/colpali-v1.2 | 2920 | 37.1 | 13.2 | 24.8 | 46.7 | 28.4 | 30.0 |
| vidore/colpali-v1.1 | 2920 | 35.4 | 16.4 | 19.0 | 44.1 | 25.6 | 28.1 |
| vidore/colSmol-500M | 500 | 43.6 | 3.7 | 7.4 | 13.5 | 6.2 | 14.9 |
| jinaai/jina-clip-v2 | 865 | 1.1 | 8.4 | 14.4 | 33.3 | 11.6 | 13.8 |
| vidore/colSmol-256M | 256 | 37.4 | 3.2 | 4.8 | 10.8 | 5.6 | 12.4 |
| google/siglip-so400m-patch14-384 | 878 | 4.0 | 3.9 | 6.3 | 21.3 | 7.2 | 8.5 |
| TIGER-Lab/VLM2Vec-Full | 4150 | 1.7 | 1.6 | 8.0 | 15.0 | 6.7 | 6.6 |
| laion/CLIP-ViT-bigG-14-laion2B-39B-b160k | 2540 | 0.5 | 1.9 | 3.3 | 12.5 | 5.6 | 4.8 |
| openai/clip-vit-base-patch16 | 151 | 0.3 | 0.6 | 0.0 | 5.9 | 3.3 | 2.0 |
The following table shows performance across all KoViDoRe v2 tasks (ndcg@10 scores as percentages, sorted by Average):
| Model | Model Size | Cybersecurity | Economic | Energy | HR | Average |
|---|---|---|---|---|---|---|
| jinaai/jina-embeddings-v4 | 3800 | 77.6 | 24.5 | 67.7 | 50.1 | 55.0 |
| TomoroAI/tomoro-colqwen3-embed-8b | 8000 | 73.7 | 16.3 | 58.5 | 26.5 | 43.8 |
| nomic-ai/colnomic-embed-multimodal-7b | 7000 | 69.6 | 12.4 | 59.5 | 33.3 | 43.7 |
| ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1 | 7000 | 66.0 | 12.1 | 55.4 | 26.4 | 40.0 |
| nomic-ai/colnomic-embed-multimodal-3b | 3000 | 47.4 | 10.5 | 44.2 | 32.9 | 33.8 |
| vidore/colqwen2-v1.0 | 2210 | 53.3 | 8.0 | 42.0 | 14.7 | 29.5 |
| TomoroAI/tomoro-colqwen3-embed-4b | 4000 | 55.3 | 9.1 | 31.0 | 10.1 | 26.4 |
| vidore/colqwen2.5-v0.2 | 3000 | 43.9 | 3.9 | 44.3 | 13.5 | 26.4 |
| eagerworks/eager-embed-v1 | 4000 | 51.5 | 5.4 | 32.7 | 7.0 | 24.2 |
| ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1 | 3000 | 41.4 | 6.3 | 31.5 | 11.3 | 22.6 |
| vidore/colpali-v1.3 | 2920 | 34.7 | 1.6 | 20.6 | 6.2 | 15.8 |
| vidore/colpali-v1.1 | 2920 | 31.9 | 3.0 | 18.2 | 6.0 | 14.8 |
| vidore/colpali-v1.2 | 2920 | 33.2 | 2.1 | 16.4 | 4.5 | 14.1 |
| vidore/colSmol-500M | 500 | 26.2 | 0.6 | 9.9 | 0.9 | 9.4 |
| jinaai/jina-clip-v2 | 865 | 20.4 | 0.2 | 11.3 | 3.1 | 8.8 |
| vidore/colSmol-256M | 256 | 19.7 | 1.0 | 9.5 | 1.1 | 7.8 |
| google/siglip-so400m-patch14-384 | 878 | 15.3 | 1.3 | 5.3 | 1.1 | 5.8 |
| laion/CLIP-ViT-bigG-14-laion2B-39B-b160k | 2540 | 13.8 | 0.3 | 4.2 | 0.4 | 4.7 |
| TIGER-Lab/VLM2Vec-Full | 4150 | 9.8 | 1.3 | 3.2 | 1.3 | 3.9 |
| openai/clip-vit-base-patch16 | 151 | 4.1 | 0.0 | 0.8 | 0.6 | 1.4 |
We provide interpretability maps to help understand how different models attend to document image patches when processing queries. Each row in the tables represents interpretability maps for different query words.
- Query: 인천 광역시의 CT 설치 비율은 몇 프로니?
| vidore/colpali-v1.3 | vidore/colqwen2.5-v0.2 | jinaai/jina-embeddings-v4 |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- Query: 지방자치단체가 보건복지부에 제출하는 문서는 무엇인가요?
| vidore/colpali-v1.3 | vidore/colqwen2.5-v0.2 | jinaai/jina-embeddings-v4 |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- Query: 나무가 주거 공간에서 제공하는 역할은 무엇인가?
| vidore/colpali-v1.3 | vidore/colqwen2.5-v0.2 | jinaai/jina-embeddings-v4 |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
# Install dependencies
uv sync# Run with custom model
uv run kovidore --model "your-model-name"
# Run specific tasks
uv run kovidore --model "your-model-name" --tasks mir vqa
# Run with custom batch size (default: 16)
uv run kovidore --model "your-model-name" --batch-size 32
# List available tasks
uv run kovidore --list-tasksfrom src.evaluate import run_benchmark
# Run all tasks
evaluation = run_benchmark("your-model-name")
# Run specific tasks
evaluation = run_benchmark("your-model-name", tasks=["mir", "vqa"])
# Run with custom batch size
evaluation = run_benchmark("your-model-name", batch_size=32)Note
Unlike KoViDoRe v1, KoViDoRe v2 is freely available on Hugging Face. You can access the full dataset collection here.
We provide pre-processed queries and query-corpus mappings for each task. However, due to licensing restrictions, you'll need to download the image datasets manually from AI Hub (see Acknowledgements section for dataset links).
Setup Instructions:
- Download the required datasets from AI Hub
- Extract and place images in the following directory structure:
data/ ├── mir/images/ ├── vqa/images/ ├── slide/images/ ├── office/images/ └── finocr/images/
The benchmark will automatically locate and use the images from these directories during evaluation.
Results are automatically saved in the results/ directory after evaluation completion. The KoViDoRe v1 uses NDCG@5 and the KoViDoRe v2 uses NDCG@10 as the main evaluation metric for all tasks.
This benchmark is inspired by the ViDoRe benchmark. We thank the original authors for their foundational work that helped shape our approach to Korean visual document retrieval.
We also acknowledge the following Korean datasets from AI Hub that were used to construct each task in KoViDoRe v1:
- 멀티모달 정보검색 데이터 - Used for KoVidoreMIRRetrieval task
- 시각화 자료 질의응답 데이터 - Used for KoVidoreVQARetrieval task
- 오피스 문서 생성 데이터 - Used for KoVidoreSlideRetrieval and KoVidoreOfficeRetrieval tasks
- OCR 데이터(금융 및 물류) - Used for KoVidoreFinOCRRetrieval task
For questions or suggestions, please open an issue on the GitHub repository or contact the maintainers:
If you use KoViDoRe in your research, please cite as follows:
@misc{KoViDoRe2025,
author = {Yongbin Choi and Yongwoo Song},
title = {KoViDoRe: Korean Vision Document Retrieval Benchmark},
year = {2025},
url = {https://github.com/whybe-choi/kovidore-benchmark},
note = {A comprehensive benchmark for evaluating visual document retrieval models on Korean document images}
}






















