AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting

This repository contains the codebase for the paper "AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting".

Citation

If you use this code as part of any published research, please acknowledge the following paper:

@article{wu2025grounded,
  title={AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting},
  author={Wu, Jibang and Yang, Chenghao and Wu, Yi and Mahns, Simon and Wang, Chaoqi and Zhu, Hao and Fang, Fei and Xu, Haifeng},
  journal={arXiv preprint arXiv:2502.16810},
  year={2025}
}

Data Release

The datasets used in this research are available at:

User Preference Data: https://huggingface.co/datasets/Sigma-Lab/AI_Realtor_User_Preference_Anonymized
Listing Data: https://huggingface.co/datasets/Sigma-Lab/AI_Realtor_Listing_Data

Important: Users must agree to the license terms before accessing the datasets.

License and Usage

This project is intended only for educational and research purposes, not for commercial purposes.

Privacy Disclaimer

Reasonable efforts have been made to process the data and remove or anonymize Personally Identifiable Information (PII). However, the complete absence of PII cannot be guaranteed. The User agrees to handle the Dataset with care and is solely responsible for:

Ensuring their use of the Dataset complies with all applicable privacy laws and regulations (e.g., GDPR, CCPA).
Any consequences arising from the use of any PII that may remain within the Dataset.
Not attempting to re-identify any individuals from the anonymized data.

Setup Instructions

1. Install Dependencies

pip install -r requirements.txt

requirements.txt includes GPU/local-LLM dependencies such as vllm, which may require a CUDA-enabled Linux environment. For the credential-free artifact smoke test on a CPU-only machine, the following smaller dependency set is sufficient:

pip install datasets pandas numpy matplotlib seaborn scipy

2. Configure API Keys

Only scripts that call hosted LLM APIs require credentials. The smoke test below does not need an OpenAI key. If you have a key, set it as an environment variable rather than editing source files:

export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_ORG_ID="your-openai-org-id"  # optional

On Windows PowerShell:

$env:OPENAI_API_KEY = "your-openai-api-key"
$env:OPENAI_ORG_ID = "your-openai-org-id"  # optional

The Elasticsearch demos read optional connection settings from:

export ELASTICSEARCH_URL="https://localhost:9200/"
export ELASTICSEARCH_USERNAME="elastic"
export ELASTICSEARCH_PASSWORD="your-password"

3. Prepare Data Artifacts

Download the listing data with:

python -c "from utils import get_original_all_features_data; get_original_all_features_data()"

This will download the listing data from Hugging Face and save it to ./data/ai_realtor_listing_data.json.

Some research scripts also expect additional governed or derived artifacts that are not generated by the smoke test:

data/extracted_features.jsonl: highlight-feature annotations used by get_highlight_data() and highlight-model prompting/evaluation scripts.
responses_latest.json: anonymized user preference responses used by user-simulation and hallucination-detection scripts.
ratings.pkl: the governed paper Elo artifact. The public smoke test can use ratings.synthetic.pkl instead.

The public Hugging Face user-preference dataset can be used to reconstruct the user-simulation input after accepting its license terms. If a script references one of the filenames above, place the corresponding file at that path or adjust the script argument where one is available.

4. Credential-Free Artifact Smoke Test

The following commands exercise the public, non-API artifact path:

python -c "from utils import get_original_all_features_data; get_original_all_features_data()"
python benchmark/win_rate_plot.py
python benchmark/generate_synthetic_ratings.py --output ratings.synthetic.pkl
python benchmark/elo_plot.py --ratings-pkl ratings.synthetic.pkl

Expected outputs:

data/ai_realtor_listing_data.json with 1,883 listing records.
comparison_win_rates_improved.pdf from benchmark/win_rate_plot.py.
ratings.synthetic.pkl, a non-sensitive ratings file used only for smoke testing.
elo_ratings_grouped.pdf from benchmark/elo_plot.py.

The original ratings.pkl used for the paper's Elo visualization is not included in the public repository because it is derived from privacy/ethics-sensitive evaluation artifacts. If you have governed access to that file, place it at ratings.pkl or pass its path with python benchmark/elo_plot.py --ratings-pkl path/to/ratings.pkl.

5. Optional OpenAI-Backed Checks

Reviewers with an OpenAI key can additionally verify hosted-model paths. These commands make API calls and may incur cost, latency, and rate-limit delays. They should be treated as optional reproduction checks, not as part of the default smoke test.

First confirm the key is visible:

python -c "import os; assert os.environ.get('OPENAI_API_KEY'), 'OPENAI_API_KEY is not set'"

A small end-to-end API sanity check is:

python rag_agents/preference_summary_from_ranking_demo.py

Expected outputs:

Printed preference summaries for the four built-in example users.
preference_analysis_responses_binary_feedback.pkl.
preference_analysis_responses_binary_feedback.csv.

To reproduce the OpenAI-based highlight prompting baseline, prepare data/ai_realtor_listing_data.json and data/extracted_features.jsonl, then run:

python highlight_model/prompt_baseline_gpt4.py --model gpt-4o

Expected outputs are checkpoint files under prompting_baseline_outputs/gpt-4o/, named like highlight_model_prompting_gpt4_output_0.pt. The script processes 10 batches and skips already-existing batch outputs, so it can be resumed.

To reproduce the OpenAI batch user-simulation path, prepare responses_latest.json, then run:

python user_simulation/predicting_preference_batch_api.py \
  --data responses_latest.json \
  --model_name gpt-4o-mini \
  --exp_name naive_few_shot \
  --eval_mode online

Expected behavior:

The first run creates batch input files, submits OpenAI Batch API jobs, and writes checkpoints under responses_latest_batch_api/.
Later runs poll existing batch jobs and download completed results.
Once all batches complete, the script saves batch_scores.pt..., batch_accuracy.pt..., accuracy_histogram.pdf, and shotwise_accuracy.pdf.

Project Structure

├── benchmark/                    # Evaluation and benchmarking scripts
├── hallucination_detection/      # Hallucination detection and evaluation
├── highlight_model/             # Highlight model training and inference
├── rag_agents/                  # RAG-based agent implementations
├── user_simulation/             # User preference simulation and prediction
├── const.py                     # Constants and feature mappings
├── utils.py                     # Utility functions
└── requirements.txt             # Python dependencies

Key Components

Feature Processing

const.py: Contains the desired feature names and mappings from original features to standardized ones
utils.py: Utility functions for data processing, feature normalization, and data loading

User Simulation

user_simulation/: Contains scripts for predicting user preferences and simulating user behavior

Highlight Model

highlight_model/: Training and inference scripts for the highlight model that identifies important features

RAG Agents

rag_agents/: Retrieval-Augmented Generation agents for generating persuasive real estate descriptions

Evaluation

benchmark/: Scripts for evaluating model performance using ELO ratings and win rates
hallucination_detection/: Tools for detecting and evaluating hallucination in generated content

Usage Examples

Loading Data

from utils import get_original_all_features_data, get_highlight_data

# Load listing data, please run this at the very beginning to gather necessary data for running the project codes. 
all_features = get_original_all_features_data()

Running Visualization

# Generate the win-rate plot
python benchmark/win_rate_plot.py

# Generate the Elo plot from a governed ratings artifact
python benchmark/elo_plot.py --ratings-pkl ratings.pkl

# Generate the Elo plot from a non-sensitive synthetic ratings artifact
python benchmark/generate_synthetic_ratings.py --output ratings.synthetic.pkl
python benchmark/elo_plot.py --ratings-pkl ratings.synthetic.pkl

Requirements

The main dependencies include:

PyTorch
Transformers
OpenAI
Datasets
Pandas
NumPy
Matplotlib
And many others (see requirements.txt for complete list)

Contributing

This codebase is for research purposes. If you find issues or have suggestions, please open an issue or contact the authors.

Contact

For questions about this research, please refer to the paper or contact the authors directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting

Citation

Data Release

License and Usage

Privacy Disclaimer

Setup Instructions

1. Install Dependencies

2. Configure API Keys

3. Prepare Data Artifacts

4. Credential-Free Artifact Smoke Test

5. Optional OpenAI-Backed Checks

Project Structure

Key Components

Feature Processing

User Simulation

Highlight Model

RAG Agents

Evaluation

Usage Examples

Loading Data

Running Visualization

Requirements

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmark		benchmark
hallucination_detection		hallucination_detection
highlight_model		highlight_model
rag_agents		rag_agents
user_simulation		user_simulation
README.md		README.md
check_simulation_output.py		check_simulation_output.py
const.py		const.py
get_feature_embeddings.py		get_feature_embeddings.py
process_output_log.py		process_output_log.py
requirements.txt		requirements.txt
setup.py		setup.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting

Citation

Data Release

License and Usage

Privacy Disclaimer

Setup Instructions

1. Install Dependencies

2. Configure API Keys

3. Prepare Data Artifacts

4. Credential-Free Artifact Smoke Test

5. Optional OpenAI-Backed Checks

Project Structure

Key Components

Feature Processing

User Simulation

Highlight Model

RAG Agents

Evaluation

Usage Examples

Loading Data

Running Visualization

Requirements

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages