This repository contains the evaluation logic for MyScholarQA, our personalized Deep Research System. The framework was introduced in the paper:
Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
It implements pipelines for generating and evaluating user profiles (used interchangably with the word 'constitutions') and personalized plans (used interchangably with the word 'actions') to add extra information to users' Deep Research queries. We ran this evaluation with synthetic data and LLM-as-a-judge metrics to validate our system pre-deployment (§3).
Further, it houses the logic for using LLM judges to try and simulate real user data derived from our qualitative interviews
The codebase supports:
- User profile (constitution) generation — Inferring researcher preferences (knowledge, research style, writing style, positions, audience, etc.) from their published papers ().
- Personalized plan generation — Producing query-specific plans (search, organization, generation) conditioned on those profiles (§2.1, §2.2).
- Profile evaluation — Scoring constitutions on specificity, relevance, accuracy, categorization, and citation coverage (§3.2, §3.4).
- Plan evaluation — Assessing personalization alignment, query relevance, and category fit (§3.2, §3.4).
- Simulation evaluation — Predicting user preferences from prompts for qualitative analysis (§4.3).
We also ran evaluations of personalized report generation (§2.3, §3.4), but we are currently working on building a benchmark with real (not synthetic) user data and integrating it into AstaBench for experimental rigor. Stay tuned 👀
The UI and demo for Personalized ScholarQA live in a separate repository and deployment, described next.
If you're interested in the online MyScholarQA system, check these out:
| Resource | Link |
|---|---|
| 🌐 Live demo | personalized-scholarqa.apps.allenai.org |
| 💻 UI repository | github.com/allenai/personalized-scholarqa |
This repository was tested with Python 3.12.8 and pip 25.0. To get started, follow the steps below:
git clone https://github.com/nbalepur/personalized-scholarqa-eval.git
cd personalized-scholarqa-evalpython3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtProfile and plan generation require preprocessed paper objects. Download and unzip them into the local datasets directory:
-
Download paper_objects.zip from Google Drive.
-
Unzip so that paper files sit under
data/local_datasets/paper_objects/:# From repo root unzip /path/to/paper_objects.zip -d data/local_datasets/Ensure the layout is
data/local_datasets/paper_objects/<paper_id>.json(or as expected byNewPaper.from_jsonin the code).
Copy the example env and add API keys for the LLM providers you use (OpenAI, Anthropic, Gemini, and/or LiteLLM):
cp example.env .env
# Edit .env and set OPEN_AI_TOKEN, ANTHROPIC_TOKEN, GEMINI_TOKEN, etc.All dataset is expcted under data/local_datasets/:
paper_objects/— Paper JSONs (from the zip above) that form the inputs for profile generationsimulated_profile_inputs/— A dataset of real Deep Research queries to form evaluation inputsqualitative_coding_data/— Our validation set for LLM simulation reliability derived from our qualitative interviews
We provide several scripts in the scripts/ folder for the main workflows:
From the repo root:
# Edit paths and model in scripts/generate.sh, then:
./scripts/generate.shThis runs:
- Profile generation —
model.user_profile.generate_profile_batch(reads paper collections fromdata_input_dir, usespaper_dirfor full text, writes toprofile_output_dir). - Plan generation —
model.plan.generate_plan_batch(reads profiles and inputs, writes toplan_output_dir).
Adjust paper_dir, data_input_dir, profile_output_dir, plan_output_dir, model_name, and model_type in the script as needed.
# Edit paths and model in scripts/eval_profile.sh, then:
./scripts/eval_profile.shRuns evaluation.eval_profile and then metric_scores.profile_metrics to produce profile evaluation outputs and summary metrics.
# Edit paths and model in scripts/eval_plan.sh, then:
./scripts/eval_plan.shRuns evaluation.eval_plans and metric_scores.plan_metrics for plan evaluation and metrics.
# Edit paths in scripts/simulation.sh, then:
./scripts/simulation.shRuns evaluation.predict_user_data and metric_scores.simulation_plot for the qualitative/simulation experiments.
├── data/ # Dataset loaders, paper/profile/plan types
├── evaluation/ # Eval runs, rubrics, metrics, predict_user_data
├── llms/ # Model wrappers (OpenAI, Anthropic, Gemini, LiteLLM)
├── metric_scores/ # Profile/plan metrics and simulation plots
├── model/
│ ├── user_profile/ # Constitution (profile) generation
│ └── plan/ # Personalized plan generation
├── prompt/ # Templates and user-prediction prompts
├── scripts/ # Shell scripts for generate / eval / simulation
├── example.env
└── requirements.txt
If you use this code or the Personalized ScholarQA evaluation framework in your work, please cite:
@article{balepur2026language,
title={Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users},
author={Balepur, Nishant and Hamada, Malachi and Kishore, Varsha and Feldman, Sergey and Singh, Amanpreet and Siangliulue, Pao and Chang, Joseph Chee and Choi, Eunsol and Boyd-Graber, Jordan Lee and Naik, Aakanksha},
journal={arXiv preprint arXiv:2603.16120},
year={2026}
}This project is licensed under the Apache License 2.0. See the LICENSE file for details.
This work was done during my wonderful internship at Ai2's Semantic Scholar team!
If you encounter any issues that are easy to fix, we would appreciate it if you made a pull request. For any other concerns or questions about our paper, please reach out to:
- Nishant Balepur, Intern (nishantbalepur@gmail.com)
- Aakanksha Naik, Mentor (aakankshan@allenai.org)
