Skip to content

microsoft/RHELM

Repository files navigation

RHELM: Beyond Static Dialogues

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Horizon Memory

Paper Project Page HuggingFace GitHub

RHELM Overview

📖 Overview

RHELM is a comprehensive benchmark for evaluating long-horizon memory capabilities in AI systems. Unlike existing benchmarks that focus on static dialogues, RHELM introduces realistic, heterogeneous, and evolving memory challenges that better reflect real-world assistant scenarios.

Key Features

  • 🎭 Realistic Profiles: Diverse characters with rich backstories, preferences, and evolving life circumstances
  • 📊 Heterogeneous Data: Multi-modal external memory sources including conversations, emails, documents
  • 🔄 Temporal Evolution: Time-aware questions that test memory across different temporal contexts
  • 🧠 Challenging Question Taxonomy: 7 major categories with 26 complex characteristics requiring multi-hop reasoning, temporal synthesis, preference tracking, and hallucination detection
  • ⚠️ Memory-Conditioned Misleading Queries: "Trap" queries that conflict with the user's updated life state, requiring the assistant to detect the implicit conflict, decline the unsafe request, and propose a constraint-compliant alternative

📋 Challenge Taxonomy

RHELM features a comprehensive taxonomy of challenging memory questions across three major QA domains with 7 categories and 26 complex characteristics.

👉 View Full Challenge Taxonomy

🗂️ QA Format

Each QA file is in JSONL format

{
  "id": "fact_19130b",
  "question": "Reflecting on the morning when my routine felt particularly unsettled and I ended up with a less-than-ideal start, what did I actually have for my first meal of the day?",
  "answer": "Leftover lentil soup",
  "question_date": "2024-10-28",
  "question_type": "fact",
  "supporting_evidence": ["2024-05-26:5"],
  "characteristics": ["State-Dependent Attribute"]
}
Field Type Description
id string Unique question identifier, prefixed by its question type (e.g. fact_19130b).
question string The user query posed to the memory system.
answer string The ground-truth answer used for evaluation.
question_date string (YYYY-MM-DD) The date from which the question is asked. To better utilize the benchmark complexity, it is recommended to use all history evidence.
question_type string One of: fact, temporal, hallucination, aggregation, misleading, attachment, mixed.
supporting_evidence list[string] References to source items that ground the answer. Conversation evidence uses the form "<session-date>:<turn-index>" (e.g. "2024-05-26:5" = turn 5 of the 2024-05-26 session); attachment evidence references the file/section (e.g. "56_report_task_*.md:Section").
characteristics list[string] Fine-grained challenge labels for the question (e.g. State-Dependent Attribute, Multi-Hop Traversal). See the Challenge Taxonomy.

🚀 Quick Start

Installation

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Running Evaluation

The evaluation reads its dataset from the data/ directory (conversations, emails, attachments) and a QA file in JSONL format. Provide the QA file via --input-file:

# Basic RAG evaluation (dense retrieval, top-k=5)
python -m evaluation.rag_benchmark \
    --character "David_R._Ellis" \
    --input-file "data/QA_final/low_score_qa_David_R._Ellis_all_validated.jsonl"

# Full-context evaluation (no retrieval, feed all evidence to the model)
python -m evaluation.rag_benchmark \
    --character "David_R._Ellis" \
    --input-file "data/QA_final/low_score_qa_David_R._Ellis_all_validated.jsonl" \
    --full-context

# RAG evaluation including emails and attachments, with hybrid (BM25 + dense) retrieval
python -m evaluation.rag_benchmark \
    --character "David_R._Ellis" \
    --input-file "data/QA_final/low_score_qa_David_R._Ellis_all_validated.jsonl" \
    --include-attachment \
    --hybrid \
    --k 10

Configuration

LLM credentials are read from environment variables (never hard-coded):

# OpenAI
export OPENAI_API_KEY="sk-..."

# or Azure OpenAI
export AZURE_OPENAI_ENDPOINT="https://<resource>.openai.azure.com/"
export AZURE_OPENAI_API_KEY="..."

Dataset locations, embedding model, chunking and output paths can be customised in evaluation/configs/config.py.

📦 Data & Code Release

Component Status
Evaluation Framework ✅ Available
Benchmark Data 🤗 HuggingFace
Data Generation Code 🔜 To be released

Note: Data generation pipeline will be released upon paper acceptance

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors