Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Horizon Memory
RHELM is a comprehensive benchmark for evaluating long-horizon memory capabilities in AI systems. Unlike existing benchmarks that focus on static dialogues, RHELM introduces realistic, heterogeneous, and evolving memory challenges that better reflect real-world assistant scenarios.
- 🎭 Realistic Profiles: Diverse characters with rich backstories, preferences, and evolving life circumstances
- 📊 Heterogeneous Data: Multi-modal external memory sources including conversations, emails, documents
- 🔄 Temporal Evolution: Time-aware questions that test memory across different temporal contexts
- 🧠 Challenging Question Taxonomy: 7 major categories with 26 complex characteristics requiring multi-hop reasoning, temporal synthesis, preference tracking, and hallucination detection
⚠️ Memory-Conditioned Misleading Queries: "Trap" queries that conflict with the user's updated life state, requiring the assistant to detect the implicit conflict, decline the unsafe request, and propose a constraint-compliant alternative
RHELM features a comprehensive taxonomy of challenging memory questions across three major QA domains with 7 categories and 26 complex characteristics.
👉 View Full Challenge Taxonomy
Each QA file is in JSONL format
{
"id": "fact_19130b",
"question": "Reflecting on the morning when my routine felt particularly unsettled and I ended up with a less-than-ideal start, what did I actually have for my first meal of the day?",
"answer": "Leftover lentil soup",
"question_date": "2024-10-28",
"question_type": "fact",
"supporting_evidence": ["2024-05-26:5"],
"characteristics": ["State-Dependent Attribute"]
}| Field | Type | Description |
|---|---|---|
id |
string | Unique question identifier, prefixed by its question type (e.g. fact_19130b). |
question |
string | The user query posed to the memory system. |
answer |
string | The ground-truth answer used for evaluation. |
question_date |
string (YYYY-MM-DD) |
The date from which the question is asked. To better utilize the benchmark complexity, it is recommended to use all history evidence. |
question_type |
string | One of: fact, temporal, hallucination, aggregation, misleading, attachment, mixed. |
supporting_evidence |
list[string] | References to source items that ground the answer. Conversation evidence uses the form "<session-date>:<turn-index>" (e.g. "2024-05-26:5" = turn 5 of the 2024-05-26 session); attachment evidence references the file/section (e.g. "56_report_task_*.md:Section"). |
characteristics |
list[string] | Fine-grained challenge labels for the question (e.g. State-Dependent Attribute, Multi-Hop Traversal). See the Challenge Taxonomy. |
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtThe evaluation reads its dataset from the data/ directory (conversations, emails,
attachments) and a QA file in JSONL format. Provide the QA file via --input-file:
# Basic RAG evaluation (dense retrieval, top-k=5)
python -m evaluation.rag_benchmark \
--character "David_R._Ellis" \
--input-file "data/QA_final/low_score_qa_David_R._Ellis_all_validated.jsonl"
# Full-context evaluation (no retrieval, feed all evidence to the model)
python -m evaluation.rag_benchmark \
--character "David_R._Ellis" \
--input-file "data/QA_final/low_score_qa_David_R._Ellis_all_validated.jsonl" \
--full-context
# RAG evaluation including emails and attachments, with hybrid (BM25 + dense) retrieval
python -m evaluation.rag_benchmark \
--character "David_R._Ellis" \
--input-file "data/QA_final/low_score_qa_David_R._Ellis_all_validated.jsonl" \
--include-attachment \
--hybrid \
--k 10LLM credentials are read from environment variables (never hard-coded):
# OpenAI
export OPENAI_API_KEY="sk-..."
# or Azure OpenAI
export AZURE_OPENAI_ENDPOINT="https://<resource>.openai.azure.com/"
export AZURE_OPENAI_API_KEY="..."Dataset locations, embedding model, chunking and output paths can be customised in evaluation/configs/config.py.
| Component | Status |
|---|---|
| Evaluation Framework | ✅ Available |
| Benchmark Data | 🤗 HuggingFace |
| Data Generation Code | 🔜 To be released |
Note: Data generation pipeline will be released upon paper acceptance
