JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and Job Descriptions
JobResQA is a multilingual Question Answering benchmark for evaluating LLM capabilities on HR-specific tasks. The dataset contains 581 QA pairs across 105 synthetic résumé-job description pairs in 5 languages (en, es, it, de, zh), with three complexity levels from basic extraction to cross-document reasoning.
Key Features:
- Multilingual: Parallel data in 5 languages (
data/) - Privacy-Preserving: Synthetic data with anonymization (
resources/placeholders/) - Three Complexity Levels: Basic (26.5%), Intermediate (36.7%), Complex (36.8%)
- Fairness-Aware: Controlled demographic attributes for bias analysis
The benchmark consists of 5 language-specific TSV files (data/):
jobresqa.en.tsv- Englishjobresqa.de.tsv- Germanjobresqa.es.tsv- Spanishjobresqa.it.tsv- Italianjobresqa.zh.tsv- Chinese
Each TSV file contains: example_id, resume_id, resume, jd_id, jd, question, short_answer, explanation, notes, complexity_level, language
Anonymization: All personal information uses placeholders like [NAME], [EMAIL], [PHONE], [COMPANY], etc. See resources/placeholders/ for the complete list.
import pandas as pd
df = pd.read_csv('data/jobresqa.en.tsv', sep='\t')data/- Benchmark dataset (5 language TSV files)resources/- Prompts and resourcesprompts/- LLM prompts for QA, generation, and translationplaceholders/- Anonymization placeholdersmqm_annotation/- Translation quality metrics
scripts/- Example scripts for QA, evaluation, generation, and translationsrc/- Source code
git clone https://github.com/yourusername/jobresqa-benchmark.git
cd jobresqa-benchmark
bash install.sh
cp .env.example .env # Add your API keysRequired environment variables:
OPENAI_API_KEY- OpenAI API keyREPO_DIR- Path to this repository
The scripts/ directory contains example scripts:
run_qa.py- Question answeringrun_eval_qa.py- Evaluate answers using G-Evalrun_resume_synthetic_generation.py- Generate synthetic résumésrun_JD_synthetic_generation.py- Generate job descriptionsrun_translation.py- TEaR translation framework
python scripts/run_qa.pyresources/prompts/ contains LLM prompts:
qa/- Question answering and evaluationresume_jd_generation/- Synthetic data generationtear_human_in_the_loop/- Translation framework (TEaR)
resources/placeholders/ contains anonymization placeholders:
placeholders.{lang}.txt- Language-specific listsplaceholders_translations_dictionary.json- Cross-language translations
resources/mqm_annotation/ contains translation quality metrics:
mqm_error_categories.txt- Error taxonomymqm_human_translations.{lang_pair}.txt- Human translation examplesmqm_human_errors.{lang_pair}.txt- Annotated errors
This work is available at arXiv as a preprint with the title JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and Job Descriptions.
If you use this benchmark, please cite the following paper:
@misc{carrino2026jobresqabenchmarkllmmachine,
title={JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual R\'esum\'es and JDs},
author={Casimiro Pio Carrino and Paula Estrella and Rabih Zbib and Carlos Escolano and José A. R. Fonollosa},
year={2026},
eprint={2601.23183},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.23183},
}Licensed under CC BY-SA 2.0. Copyright © 2025 Avature.
For questions, please open an issue on GitHub.