GitHub - VisionChengzhuo/CoF-T2I: Video models as pure visual reasoners for high-quality text-to-image generation via Chain-of-Frame reasoning.

✨ News

[2026.01.16] 🔥🔥🔥 We have released CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation, check out our 📄 Paper · 🌐 Website.

🎯 Todo List

We are actively preparing to release the following:

Paper & project page
Training & inference code
CoF-T2I model checkpoints & evaluation scripts
CoF-Evol-Instruct dataset

🎞️ CoF-T2I

CoF-T2I brings Chain-of-Frame (CoF) reasoning from video generation into text-to-image generation via progressive visual refinement: intermediate frames serve as explicit reasoning steps, and the final frame is taken as the output image.

Visualization of the reasoning trajectories generated by CoF-T2I. The final output is shown in large, and intermediate frames are shown in small.

🎯 Contributions

🔭 A novel generation paradigm: We propose CoF-T2I, a text-to-image model that repurposes a video foundation model as pure visual reasoner, generating images via a CoF reasoning process.
📖 A comprehensive dataset with scalable pipeline: We introduce CoF-Evol-Instruct, a 64K-scale dataset of progressive visual refinement trajectories, built with a scalable quality-aware pipeline.
📊 Competitive results with extensive validation: Our extensive experiments show that CoF-T2I substantially outperforms its video backbone and achieves competitive performance on challenging benchmarks, with additional validations confirming its substantial promise.

🔍 Overview

Overview of CoF-T2I. CoF-T2I builds on a video generation backbone, reframing inference-time reasoning for T2I generation as a CoF refinement process.

📑 CoF-Evol-Instruct

We design a quality-aware construction pipeline and curate 64K reasoning trajectories, ensuring both sample-level diversity and frame-wise consistency.

🚀 Quick Start (Coming Soon)

We are preparing the release of training, inference, and evaluation code.

📈 Key Results & 🖼️ More Visualizations

Unfold to see our key results and more visualizations

📈 Key Results

1️⃣ Performance comparison on GenEval

The best and the second best Overall scores are in bold and underlined, respectively.

Model	Single Obj.	Two Obj.	Counting	Colors	Position	Color Attr.	Overall ↑
*Standard Image Models*
SDXL	0.98	0.74	0.39	0.85	0.15	0.23	0.55
SD3-Medium	0.99	0.94	0.72	0.89	0.33	0.60	0.74
FLUX.1-dev	0.99	0.88	0.61	0.87	0.35	0.55	0.67
*Unified MLLMs*
Janus-Pro-7B	0.99	0.89	0.59	0.90	0.79	0.66	0.80
BLIP3-o 8B	--	--	--	--	--	--	0.84
OmniGen2	0.99	0.92	0.77	0.90	0.82	0.70	0.80
BAGEL	0.99	0.94	0.81	0.88	0.64	0.63	0.78
BAGEL-Think	0.99	0.94	0.81	0.88	0.64	0.63	0.82
T2I-R1	0.99	0.91	0.53	0.91	0.76	0.65	0.79
*Video Models*
Wan2.1-T2V-14B	0.92	0.63	0.57	0.69	0.18	0.31	0.55
CoF-T2I (Ours)	0.98	0.95	0.83	0.89	0.83	0.71	0.86

2️⃣ Performance comparison on Imagine-Bench

The best and the second best scores are in bold and underlined, respectively.

Model	Attribute shift	Hybridization	Multi-Object	Spatiotemporal	Overall ↑
*Standard Image Models*
SDXL	4.420	4.930	4.500	6.320	4.970
SD3-Medium	5.140	6.300	6.070	5.910	5.780
FLUX.1-dev	5.680	6.380	5.240	7.130	6.060
*Unified MLLMs*
Janus-Pro-7B	5.300	6.730	6.040	7.280	6.220
BLIP3-o 8B	5.800	7.060	6.440	7.080	6.510
OmniGen2	5.280	6.290	6.310	7.450	6.220
BAGEL	5.370	6.500	6.410	6.930	6.200
BAGEL-Think	6.260	7.740	6.960	7.130	6.930
T2I-R1	5.850	7.360	6.680	7.700	6.780
*Video Models*
Wan2.1-T2V-14B	5.436	6.950	5.383	6.237	5.939
CoF-T2I (Ours)	6.969	8.070	7.797	7.287	7.468

🖼️ More Visualizations

1️⃣ Dataset Visualizations

Visualization of CoF-Evol-Instruct Dataset. We showcase the prompt and corresponding CoF trajectories in our data.

2️⃣ Qualitative Comparison

Comparison of the Wan2.1-T2V (baseline), Bagel-Think, and CoF-T2I. CoF-T2I produces satisfying results with both high photorealistic quality and precise alignment with the prompt.

3️⃣ Reasoning Trajectories

Complete reasoning trajectories of CoF-T2I, including the intermediate frames and the final output alongside their corresponding prompts.

4️⃣ Step-wise Quality Evolution

Performance trend across reasoning steps on GenEval (left) and Imagine-Bench (right).

🫡 Acknowledgements

We would like to thank the following open-source projects and research works:

🧾 Citation

If you find our work useful, please consider citing:

@misc{tong2026coft2ivideomodelspure,
      title={CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation}, 
      author={Chengzhuo Tong and Mingkun Chang and Shenglong Zhang and Yuran Wang and Cheng Liang and Zhizheng Zhao and Ruichuan An and Bohan Zeng and Yang Shi and Yifan Dai and Ziming Zhao and Guanbin Li and Pengfei Wan and Yuanxing Zhang and Wentao Zhang},
      year={2026},
      eprint={2601.10061},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.10061}, 
}

🏷️ License

This repository is released under the MIT license. See LICENSE for additional details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ News

🎯 Todo List

🎞️ CoF-T2I

🎯 Contributions

🔍 Overview

📑 CoF-Evol-Instruct

🚀 Quick Start (Coming Soon)

📈 Key Results & 🖼️ More Visualizations

📈 Key Results

1️⃣ Performance comparison on GenEval

2️⃣ Performance comparison on Imagine-Bench

🖼️ More Visualizations

1️⃣ Dataset Visualizations

2️⃣ Qualitative Comparison

3️⃣ Reasoning Trajectories

4️⃣ Step-wise Quality Evolution

🫡 Acknowledgements

🧾 Citation

🏷️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

✨ News

🎯 Todo List

🎞️ CoF-T2I

🎯 Contributions

🔍 Overview

📑 CoF-Evol-Instruct

🚀 Quick Start (Coming Soon)

📈 Key Results & 🖼️ More Visualizations

📈 Key Results

1️⃣ Performance comparison on GenEval

2️⃣ Performance comparison on Imagine-Bench

🖼️ More Visualizations

1️⃣ Dataset Visualizations

2️⃣ Qualitative Comparison

3️⃣ Reasoning Trajectories

4️⃣ Step-wise Quality Evolution

🫡 Acknowledgements

🧾 Citation

🏷️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages