UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, Qwen-Verify, for efficient failure detection and instruction recaptioning. This pipeline yields UnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose UnicBench, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including Non-edit Consistency and Reasoning Accuracy. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research. The dataset, benchmark, and code will be released.
- [2025.12.2] Code and benchmark released.
- [2025.12.2] Paper released on arXiv.
- Release UnicBench evaluation code
- Release benchmark test data
- Release UnicEdit-10M dataset
- Release Qwen-Verify model
- Release data generation pipeline
- UnicEdit-10M: A quality-aware data curation pipeline with unified post-verification and a 10M-scale high-quality image editing dataset with diverse basic and complex editing tasks.
- Qwen-Verify: A 7B dual-task expert model for efficient failure detection and instruction recaptioning.
- UnicBench: A comprehensive benchmark with novel metrics (Non-edit Consistency, Reasoning Accuracy) for fine-grained diagnosis.
UnicBench/
βββ assets/ # Images for README
βββ data/
β βββ prompts.py # VLM evaluation prompts (IF, NC, VQ, RA)
β βββ test_data.jsonl # Benchmark test data
βββ eval/
β βββ eval_pipeline.py # Main evaluation pipeline
β βββ calculate_scores.py # Score statistics tool
βββ inference/
β βββ gen_samples_flux.py # Generate samples using FLUX
β βββ gen_samples_flux.sh # Shell script for inference
βββ models/ # VLM models for evaluation
# Create conda environment
conda create -n unicbench python=3.11
conda activate unicbench
# Install dependencies
pip install -r requirements.txtYou can load the dataset directly from Hugging Face using the datasets library:
from datasets import load_dataset
# Load the dataset
ds = load_dataset("xiaotanhua/UnicBench")
# Access data
print(ds['train'][0])UnicBench consists of 1,100 samples across 4 task categories and 22 subtasks:
| Task Category | Subtasks | Samples |
|---|---|---|
| Object Editing | 7 subtasks | 350 |
| Attribute Editing | 5 subtasks | 250 |
| Scene Editing | 5 subtasks | 250 |
| Reasoning Editing | 5 subtasks | 250 |
| Metric | Description |
|---|---|
| IF (Instruction Following) | Measures how well the edit follows the given instruction |
| NC (Non-edit Consistency) | Measures consistency of non-edited regions |
| VQ (Visual Quality) | Measures visual quality and naturalness of edited images |
| RA (Reasoning Accuracy) | Measures reasoning accuracy (only for Reasoning Editing tasks) |
First, generate edited images using your image editing model. The output should be saved following this path format:
{save_dir}/{model_name}/{subtask_name}/{language}/{key}.png
We provide reference inference scripts for FLUX.1-Kontext and Qwen-Image-Edit:
bash inference/gen_samples_flux.sh # for FLUX.1-Kontext
bash inference/gen_samples_qwen.sh # for Qwen-Image-EditThe output directory structure must follow the format below:
{save_dir}/
βββ {model_name}/
βββ {subtask_name}/{language}/ # Edited images
βββ eval_output/{vlm_name}/
βββ {subtask_name}_{language}_results.jsonl # Per-sample results
βββ statistics/
βββ {language}_statistics.json # Aggregated statistics
Use eval_pipeline.py to evaluate edited images and compute final scores. You can load data from a local JSONL file or directly from Hugging Face.
Option 1: Using Hugging Face Dataset (Recommended)
cd eval
python eval_pipeline.py \
--data_path xiaotanhua/UnicBench \
--save_dir /path/to/results \
--edit_model_name your_model_name \
--vlm_model_name gpt-4.1 \
--languages en \
--num_workers 8Option 2: Using Local JSONL File
cd eval
python eval_pipeline.py \
--data_path ../data/test_data.jsonl \
--image_dir /path/to/benchmark/images \
--save_dir /path/to/results \
--edit_model_name your_model_name \
--vlm_model_name gpt-4.1 \
--languages en \
--num_workers 8Parameters:
| Parameter | Description |
|---|---|
--data_path |
Path to test data jsonl file OR Hugging Face dataset name (e.g., xiaotanhua/UnicBench) |
--image_dir |
Directory containing original benchmark images (Required for JSONL, Optional for HF dataset) |
--save_dir |
Root directory to save results |
--edit_model_name |
Name of your editing model |
--vlm_model_name |
VLM model for evaluation (default: gpt-4.1-2025-04-14) |
--languages |
Languages to evaluate: en, cn, or both |
--num_workers |
Number of parallel workers (for API-based VLMs) |
--skip_evaluation |
Skip evaluation, only compute statistics |
If evaluation has already been completed and you only need to aggregate statistics, use calculate_scores.py to compute score statistics from evaluation results:
python calculate_scores.py \
--save_dir /path/to/results \
--edit_model_name your_model_name \
--vlm_model_name gpt-4.1 \
--languages en cnEvaluation results of mainstream image editing models on UnicBench:
@article{ye2025unicedit,
title={UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits},
author={Ye, Keming and Huang, Zhipeng and Fu, Canmiao and Liu, Qingyang and Cai, Jiani and Lv, Zheqi and Li, Chen and Lyu, Jing and Zhao, Zhou and Zhang, Shengyu},
journal={arXiv preprint arXiv:2512.02790},
year={2025}
}This project is released under the Apache 2.0 License.
We thank all contributors and the open-source community for their support.



