Personalized ScholarQA Evaluation

This repository contains the evaluation logic for MyScholarQA, our personalized Deep Research System. The framework was introduced in the paper:

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

It implements pipelines for generating and evaluating user profiles (used interchangably with the word 'constitutions') and personalized plans (used interchangably with the word 'actions') to add extra information to users' Deep Research queries. We ran this evaluation with synthetic data and LLM-as-a-judge metrics to validate our system pre-deployment (§3).

Further, it houses the logic for using LLM judges to try and simulate real user data derived from our qualitative interviews

📋 Overview

The codebase supports:

User profile (constitution) generation — Inferring researcher preferences (knowledge, research style, writing style, positions, audience, etc.) from their published papers ().
Personalized plan generation — Producing query-specific plans (search, organization, generation) conditioned on those profiles (§2.1, §2.2).
Profile evaluation — Scoring constitutions on specificity, relevance, accuracy, categorization, and citation coverage (§3.2, §3.4).
Plan evaluation — Assessing personalization alignment, query relevance, and category fit (§3.2, §3.4).
Simulation evaluation — Predicting user preferences from prompts for qualitative analysis (§4.3).

We also ran evaluations of personalized report generation (§2.3, §3.4), but we are currently working on building a benchmark with real (not synthetic) user data and integrating it into AstaBench for experimental rigor. Stay tuned 👀

The UI and demo for Personalized ScholarQA live in a separate repository and deployment, described next.

🔗 Links

If you're interested in the online MyScholarQA system, check these out:

Resource	Link
🌐 Live demo	personalized-scholarqa.apps.allenai.org
💻 UI repository	github.com/allenai/personalized-scholarqa

🛠️ Setup

This repository was tested with Python 3.12.8 and pip 25.0. To get started, follow the steps below:

1️⃣ Clone the repository

git clone https://github.com/nbalepur/personalized-scholarqa-eval.git
cd personalized-scholarqa-eval

2. Install all requirements

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

3️⃣ Download paper objects

Profile and plan generation require preprocessed paper objects. Download and unzip them into the local datasets directory:

Download paper_objects.zip from Google Drive.
Unzip so that paper files sit under data/local_datasets/paper_objects/:
```
# From repo root
unzip /path/to/paper_objects.zip -d data/local_datasets/
```
Ensure the layout is data/local_datasets/paper_objects/<paper_id>.json (or as expected by NewPaper.from_json in the code).

4️⃣ Environment variables

Copy the example env and add API keys for the LLM providers you use (OpenAI, Anthropic, Gemini, and/or LiteLLM):

cp example.env .env
# Edit .env and set OPEN_AI_TOKEN, ANTHROPIC_TOKEN, GEMINI_TOKEN, etc.

📁 Dataset Information

All dataset is expcted under data/local_datasets/:

paper_objects/ — Paper JSONs (from the zip above) that form the inputs for profile generation
simulated_profile_inputs/ — A dataset of real Deep Research queries to form evaluation inputs
qualitative_coding_data/ — Our validation set for LLM simulation reliability derived from our qualitative interviews

🚀 Usage

We provide several scripts in the scripts/ folder for the main workflows:

📝 Generating profiles and plans

From the repo root:

# Edit paths and model in scripts/generate.sh, then:
./scripts/generate.sh

This runs:

Profile generation — model.user_profile.generate_profile_batch (reads paper collections from data_input_dir, uses paper_dir for full text, writes to profile_output_dir).
Plan generation — model.plan.generate_plan_batch (reads profiles and inputs, writes to plan_output_dir).

Adjust paper_dir, data_input_dir, profile_output_dir, plan_output_dir, model_name, and model_type in the script as needed.

📊 Evaluating profiles

# Edit paths and model in scripts/eval_profile.sh, then:
./scripts/eval_profile.sh

Runs evaluation.eval_profile and then metric_scores.profile_metrics to produce profile evaluation outputs and summary metrics.

📈 Evaluating plans

# Edit paths and model in scripts/eval_plan.sh, then:
./scripts/eval_plan.sh

Runs evaluation.eval_plans and metric_scores.plan_metrics for plan evaluation and metrics.

🧪 Simulation (user-prediction) evaluation

# Edit paths in scripts/simulation.sh, then:
./scripts/simulation.sh

Runs evaluation.predict_user_data and metric_scores.simulation_plot for the qualitative/simulation experiments.

📂 Project structure

├── data/                    # Dataset loaders, paper/profile/plan types
├── evaluation/              # Eval runs, rubrics, metrics, predict_user_data
├── llms/                    # Model wrappers (OpenAI, Anthropic, Gemini, LiteLLM)
├── metric_scores/           # Profile/plan metrics and simulation plots
├── model/
│   ├── user_profile/        # Constitution (profile) generation
│   └── plan/                # Personalized plan generation
├── prompt/                  # Templates and user-prediction prompts
├── scripts/                 # Shell scripts for generate / eval / simulation
├── example.env
└── requirements.txt

📚 Citation

If you use this code or the Personalized ScholarQA evaluation framework in your work, please cite:

@article{balepur2026language,
  title={Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users},
  author={Balepur, Nishant and Hamada, Malachi and Kishore, Varsha and Feldman, Sergey and Singh, Amanpreet and Siangliulue, Pao and Chang, Joseph Chee and Choi, Eunsol and Boyd-Graber, Jordan Lee and Naik, Aakanksha},
  journal={arXiv preprint arXiv:2603.16120},
  year={2026}
}

📄 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

📧 Contact

This work was done during my wonderful internship at Ai2's Semantic Scholar team!

If you encounter any issues that are easy to fix, we would appreciate it if you made a pull request. For any other concerns or questions about our paper, please reach out to:

Nishant Balepur, Intern (nishantbalepur@gmail.com)
Aakanksha Naik, Mentor (aakankshan@allenai.org)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personalized ScholarQA Evaluation

📋 Overview

🔗 Links

🛠️ Setup

1️⃣ Clone the repository

2. Install all requirements

3️⃣ Download paper objects

4️⃣ Environment variables

📁 Dataset Information

🚀 Usage

📝 Generating profiles and plans

📊 Evaluating profiles

📈 Evaluating plans

🧪 Simulation (user-prediction) evaluation

📂 Project structure

📚 Citation

📄 License

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
evaluation		evaluation
images		images
llms		llms
metric_scores		metric_scores
model		model
prompt		prompt
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
dependency_licenses.csv		dependency_licenses.csv
enums.py		enums.py
example.env		example.env
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Personalized ScholarQA Evaluation

📋 Overview

🔗 Links

🛠️ Setup

1️⃣ Clone the repository

2. Install all requirements

3️⃣ Download paper objects

4️⃣ Environment variables

📁 Dataset Information

🚀 Usage

📝 Generating profiles and plans

📊 Evaluating profiles

📈 Evaluating plans

🧪 Simulation (user-prediction) evaluation

📂 Project structure

📚 Citation

📄 License

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages