Skip to content

shallweiwei/GPT-4o-Image-Generation-for-OCR

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Generative Models for OCR🚀

This repository is about evaluating state-of-the-art image generators’ generation and editing capability on various Optical Character Recognition (OCR) tasks, including both closed-source and open-source models. Currently, we have tested GPT-4o, Qwen-VLo, Flux.1-Kontext-dev, and Janus-4o. The evaluation include generating multiple types of text images (handwritten notes, printed documents, poster, street signs, historical manuscript, etc.) and editing specific content of text images. We aim to understand the boundaries of SOTA image generation models applied to the specialized field of OCR, identify remaining challenges, and explore how close we are to achieving AGI-level capabilities in this domain.

This repository was formerly known as GPT-4o-Image-Generation-for-OCR, and included only the evaluation of image generation capabilities of GPT-4o. Now we are expanding our evaluation to more models, especially open-source models.

Welcome 🌟issues, PR, and stars🌟 for more comprehensive testing or join us for more comprehensive evaluation!

📃News

📌Pinned

  • 🎉 [July 2025] Our paper is online at arXiv! Welcome citation and star if you found our work useful! 😊

  • 🔥 [June 2025] Expanded evaluation now includes various closed-source and open-source models!

  • 📢 [March 2025] Initial evaluation of GPT-4o's image generation capabilities now available!

💎Observations

GPT-4o

Tasks with Good Performance (few or no errors):

  • Text-to-Image (T2I) Generation (Handwritten text, scene text, slides or other creative graphics, ancient text, overlapping text and images)
  • Text Super-Resolution, Text Style Transfer, Scene Text Removal

Tasks with Marginal Performance (sometimes works, sometimes doesn't):

  • Handwritten Text Removal, Layout-Aware Text Generation

Tasks Currently Unachievable:

  • Document Dewarping, Document Shadow Removal, Document Deblurring, Document Appearance Enhancement
  • Historical Document Restoration, Historical Document Style Transfer
  • Ordered Text Generation (Generating text like 0, 1, 2, ...)

Technical Characteristics:

  1. GPT-4o excels at generating creative and design-oriented images with text, such as slides and street scenes, when given detailed prompts.
  2. GPT-4o generates images with dimensions that are multiples of 512 pixels. Therefore, in tasks requiring image inputs (text editing, document dewarping, etc.), it mostly fails to maintain the original image's aspect ratio and incorrectly outputs images as square.
Click to view detailed observations of GPT-4o's evaluation.
  1. Excellent at generating English text, but the accuracy of Chinese character generation is low. Only larger Chinese characters are generated accurately; smaller Chinese characters are almost completely incorrect.

  2. Can generate simplified Chinese characters but cannot generate complex Chinese characters.

  3. When performing image editing, the unedited parts of the image can not be accurately replicated and are often accompanied by cropping, expansion, sharpening, detail changes, etc.

  4. In tasks involving image input, if the image contains dense text, the text in the output image is likely to be severely garbled (e.g., document rectification, document shadow removal, historical document restoration, historical document style transfer).

  5. In tasks involving image input, if the image itself contains embedded graphics, the embedded graphics cannot be restored in the output (e.g., document rectification).

  6. Most likely does not use OCR to recognize text and then re-render it.

Qwen-VLo

Technical Characteristics:

  1. The reliance (e.g., weights during generation) on previous history is too heavy, leading to poor instruction following sometimes.

  1. Unable to smartly identify user intension of generating images or textual response. For example, when prompted to “remove all handwritten text in this image” (left), it provides a step-by-step textual explanation rather than producing the edited image. Only when explicitly instructed to “output the resulted image” (right) does the model generate the visual result users actually need.

  1. It fails to render a large amount of text, no matter English or Chinese. Few successful cases.
  2. Poor instruction following ability. For instance, the model output squared images given the instruction of outputting rectangle images. It outputs a book page given the instruction of generating a slide.

Flux.1-Kontext-dev

Technical Characteristics:

  1. The model can partially handle English image generation or editing, whereas fails to perform Chinese image generation.
  2. It mostly fails to maintain the original image's aspect ratio and incorrectly outputs images as square. We did not find any parameters to control original size preserved generation. However, in the official website of Flux.AI, the user can select “match input” as the output image’s dimension. We are looking into this.

Janus-4o

Technical Characteristics:

Janus-4o nearly has no text rendering ability in terms of either English or Chinese text, potentially due to its small model size (7B).

📖Content

📄Modern Document Image

Document Dewarping

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Please perform dewarping on this document to make it flat and clear. EN


Texts are chaotic and blurred. Three columns become two columns.

🤔
Flat document but totally blurred text.


Totally failed and blurred text.


The image has been enhanced and text is completely unreadable .


Not dewarpped.


Totally failed and blurred text.

Please perform dewarping on this document to make it flat and clear. EN


Embedded drawing is not correctly restored. Partial Texts are blurred.


Not dewarpped. Large texts are clear but small ones are blurred.


Not dewarpped. Large texts are clear but small ones are blurred.


The image has been enhanced. Large texts are clear but small ones are blurred.


Not dewarpped. Large texts are clear but small ones are blurred.


Totally failed and chaotic, blurred text.

请帮我把这张图片中的文档矫正成一张平铺、清晰的文档 ZH


Only the large text is good. Small text is incompletely restored and blurred.

🤔
Flat document but totally incorrect text.


Totally failed and blurred text.


Not dewarpped. Large texts are clear but small ones are blurred.


Totally failed and text is completely unreadable.

裁剪出演唱会的票 ZH

🤔
Direction is correct. The Chinese text is visual-like but meaningless.


Totally wrong.


Not cropped.


The image has been enhanced. Large texts are clear but small ones are blurred.


Totally failed and text is completely unreadable.

裁剪出票据 ZH

🤔
Only the large text is good. Small text is blurred or lacks semantic.

🤔
Only the large text is good. Small text is blurred or lacks semantic.


Not cropped.


All content has been erased.


The image has been enhanced. Large texts are clear but small ones are blurred.


Totally failed and text is completely unreadable.

Document Deshadowing

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

请帮我去掉这张文档图片中的阴影 EN

🤔
Shadows are removed. But the image is over-rectified.

🤔
Shadow are removed. But color is changed and text is blurred.


Shadows are removed.


Totally failed and wrong color.

🤔
Partially good. Some shadows are removed.


Totally failed and wrong color.

Process this document image to eliminate shadow artifacts and produce a clean, evenly lit version. LA

🤔
Partially good. Shadows are removed. But texts are wrong.

🤔
Shadows are removed. Text is blurred.


Shadows are removed.


Totally failed and wrong color.

🤔
Partially good. Some shadows are removed.


Totally failed and wrong color.

Document Deblur

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Deblur this document image to enhance text clarity. EN

🤔
Partially good. Text is clear but extraneous content is added.


Clear but unreadable text.

🤔
Partially good. Some text is unreadable .


All content has been erased.


Unreadable text.


Totally failed.

对本文档图像进行去模糊处理 ZH

🤔
Partially good. Some text is unreadable and and extraneous content is added.


Clear but unreadable text.


Clear but unreadable text.


Clear but unreadable text.


Unreadable text.


Totally failed.

Appearance Enhancement

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Please help me enhance this document image and output a clear, PDF-like version of the document EN


Enhanced appearance but significant content loss.


Unreadable text.


Enhanced appearance, but the text is unreadable.


Enhanced appearance, but the text is unreadable.

请帮我增强这张文档图像,输出一个类似pdf的清晰文档 ZH

🤔
Partially good. Enhanced appearance, but the table below is not in the input.


Enhanced appearance, but the text is unreadable.


Totally failed.


Totally failed.

Text Editing

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Please change the text "Stage 1: Domain-Specific Categorization" into "This is a paper of Qwen2.5-VL" EN

🤔
Modified successfully but some content is missing.


Chaotic and unreadable text.


Totally failed and entirely unseen text.

change "7.30pm" to "11.45 am" EN

🤔
Modified successfully, but some content is missing.

🤔
Modified successfully but some content is wrong.

帮我将图中的“人工智能”改为“深度学习”,“PyTorch”改为“TensorFlow” ZH

🤔
Modified successfully, but some content is missing.

🤔
Modified successfully but some content is unreadable.

将价格改为21.88 ZH

🤔
Modified successfully but some content is missing.


The number is wrong and some content is missing.

📜Historical Document Image

T2I Generation

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
生成一页中国古代书籍,泛黄的旧纸张,竖排的中文毛笔书法,传统木刻印刷风格,精美的边框,纸张边缘磨损,有古旧质感,明清风格,高细节,写实光影,从上往下的视角 ZH


Requirements fulfilled.


Chaotic and unreadable text.

一张古籍书页的特写,纸张泛黄,带有明显的岁月痕迹。页面上书写着毛笔字,内容是《道德经》的第一章:“道可道,非常道;名可名,非常名。无名,天地之始;有名,万物之母。” 字迹工整,但部分笔画略有模糊。页面边缘有虫蛀的痕迹,并有一些墨迹晕染开来。背景是深色的木质书桌,桌面上散落着一些毛笔、砚台和镇纸。光线昏暗,从左上方照射下来,营造出一种古老而神秘的氛围。 ZH

🤔
Most requirements are fulfilled but the content is incomplete and incorrect.


Chaotic and unreadable text.

生成三页连续的《史记·项羽本纪》古籍书页图片。书页采用明代风格,使用仿古宣纸,纸张略微泛黄,带有轻微的墨迹晕染。字体为工整的小楷,页面排版为传统的竖排版式,每页约20行,每行约15字。 书页边缘有轻微的磨损和虫蛀痕迹,但整体保存完好。背景为深色木质书桌,桌面干净整洁,仅有一盏古朴的油灯提供照明。光线柔和,营造出一种宁静而庄重的氛围。 请确保三页书页的风格、字体、纸张材质、墨迹晕染程度等细节保持高度一致,使它们看起来像是同一本书的连续页面。 ZH

🤔
Most requirements fulfilled. But the content is not Chinese and its language is unidentified.


Not consecutive pages and text lacks semantic.

Generate a close-up image of an aged manuscript page written in English. The page is made of thick, parchment-like material, yellowed with age and showing subtle signs of wear and tear. The text is written in a formal, calligraphic script reminiscent of the 16th century, with ornate capital letters and flowing lines.
The text on the page is an excerpt from Shakespeare's Hamlet, Act 1, Scene 2, starting with the line: "O, that this too too solid flesh would melt, Thaw and resolve itself into a dew!" and continuing for several lines.
The page has faint water stains and minor ink smudges, adding to its aged appearance. The edges are slightly frayed and uneven. The background is a dark, out-of-focus surface, perhaps a wooden table or leather-bound book. The lighting is soft and diffused, creating a sense of antiquity and scholarly atmosphere.
Ensure the script is legible but clearly handwritten, not a modern font. The overall impression should be that of a genuine historical document.
EN


Requirements fulfilled.

🤔
A historical document. The text seems not English.

Text Editing

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

将图片中的“所有不可得意界”修改成“今天天气很好” ZH


Modifications incorrect and other texts are incorrect.


Chaotic and unreadable text.


Text is not modified.

Modify "CONGRESS" to "COVERING". EN


Requirements fulfilled despite super-resolution is accidentally performed.

🤔
Modification is correct. But some content is missing.


Totally failed.

Historical Document Restoration

Input Image Prompt Language GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

修复这张古籍图片中破损和缺失的文字 ZH


Restoration failed. Original content has been changed and incorrect background.


Unreadable text .

修复这张古籍图片中破损和缺失的文字,保持文字风格相同以及背景一致 ZH


Restoration totally failed.


Unreadable text and incorrect background.

Style Transfer

Input Image1 Input Image 2 Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

请将第二张古籍图片的风格迁移到第一张古籍上,包括背景颜色、字体样式、笔画粗细等等。 EN


Style and content are totally incorrect.


Unreadable text and incorrect style.

Super Resolution

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Perform super-resolution on this image. EN


Requirements fulfilled despite some texts are cropped.

some texts are unreadable .


Super resolution failed.


Failed.

✏️Handwritten Text Image

T2I Generation

Page Level

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
A full page of handwritten study notes in neat cursive on lined paper, written in blue ink, containing the following text:

Chapter 4: Classical Mechanics and Newton's Laws Newton's Three Laws of Motion form the foundation of classical mechanics:
First Law (Inertia): An object will remain at rest or in uniform motion unless acted upon by an external force. This principle explains why objects in space continue moving indefinitely.
Second Law (F=ma): The acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass. This relationship is expressed as F=ma, where:
F represents the net force
m represents the mass
a represents acceleration

Third Law (Action-Reaction): For every action, there is an equal and opposite reaction. Examples include:
Rocket propulsion
Walking mechanics
Recoil in firearms
Key Applications in Real World:
• Automotive design and safety
• Sports biomechanics
• Aerospace engineering
• Structural design
EN


Well done!


Unreadable text .

A handwritten journal entry in flowing handwriting with slight right slant, black ink on cream paper:
September 15, 2024

Today marked my first week in Tokyo, and the city continues to amaze me at every turn. The morning began with a visit to the Tsukiji Outer Market, where the narrow alleys were already buzzing with activity by 7 AM. The aroma of grilled seafood and the calls of vendors created an atmosphere that felt both chaotic and perfectly orchestrated.

I managed to try tamago on a stick - a sweet Japanese omelet that melted in my mouth. The vendor, an elderly man with kind eyes, showed me how they carefully roll the eggs layer by layer. It's these small interactions that make traveling so meaningful.

In the afternoon, I explored the Yanaka district, one of Tokyo's oldest neighborhoods. The area survived the wartime bombings, preserving its traditional architecture and atmosphere. Small temples are tucked between modern homes, and cats roam freely through the quiet streets. I stopped at a local coffee shop where the owner has been roasting beans for over 40 years.
Must remember to visit:

- Sensoji Temple at sunrise
- Shimokitazawa for vintage shopping
- Try the ramen place recommended by Mari
- Book tea ceremony for next week
EN


Well done!


Some texts are correct but most are unreadable .

一页学生课堂笔记的照片,使用黑色中性笔书写的整页中文手写文字,字体为快速书写体,略带潦草但可辨识。笔记有标题、段落、要点突出,可能有下划线、圈注、箭头等标记。纸张为横格笔记本纸,顶部有日期与课程标题。文字密集,呈现真实的学习笔记风格。内容为:“【历史笔记】——中国古代政治制度(上)

一、宗法制与分封制
宗法制:以血缘关系维系的政治制度,核心是嫡长子继承制,确保家族权力的延续。
分封制:周天子将土地和人民分封给亲属、功臣建立诸侯国,诸侯需定期朝贡。

二、中央集权制度的确立
秦始皇统一中国后废分封、行郡县。郡县制由皇帝直接委派官员管理地方,形成中央集权雏形。

三、汉代的中外朝制度
汉武帝时设立“中朝”,由皇帝亲信掌权,引发外戚与宦官之争。外朝是传统官僚系统。

四、唐代三省六部制
中书省:起草政令;门下省:审议政令;尚书省:执行政令。六部分工明确:吏、户、礼、兵、刑、工。

五、宋代的文官体系
加强对军权的控制,设“枢密院”管理军政,官员由皇帝直接任命,中央权力进一步上升。

重点:从分封制到郡县制,是中国古代政治制度质的飞跃。”
ZH

🤔
Partially good! But the image is cropped to square.


Unreadable text .

一张泛黄的信纸,上面用钢笔写满了整段中文手写文字,书写风格自然、连贯,略有修改痕迹,字迹工整但略显随性。信纸上文字从左上角起,整齐排列至底部,行距适中。纸张有轻微折痕,整体风格温暖真实。信件内容为:“亲爱的朋友:

你好呀!

写这封信的时候,窗外正飘着细细的春雨。空气里有青草的气息,像极了我们小时候一起在巷子里追逐打闹的日子。那时候无忧无虑,天总是那么蓝,笑声也特别清脆。

最近我在读一些老书,比如《围城》和《人间词话》,越读越觉得,人的一生最重要的不是成就,而是情感的落点。想到你,我就觉得温暖。我们虽然天各一方,但文字总能让彼此靠近。

希望你一切都好,生活顺利,心情舒畅。如果有空,记得回信哦!

此致 敬礼!

你的老朋友

林然

2023年4月”
ZH


Mostly correct despite some texts are wrong.


Unreadable text .

生成一段手写的文字图片,内容为“当前,租房人口规模持续扩大,租房人口结构也发生了显著变化。蓝皮书数据显示,四大一线城市中租房人口规模接近4000万人,占比接近50%。在全国40个重点城市的租赁市场中,35岁以上的租客占比达到35%以上,较2021年增长了4.9个百分点,成为所有年龄层租客中占比提升最快的群体。”,要求书写风格独特洒脱。 ZH


Mostly correct despite some texts are wrong.


Unreadable text and missing content.

Paragraph Level

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
请给我生成一张手写文字图片,内容是“ICDAR是文档分析与识别领域的顶级会议。在数字化转型时代,这一领域的重要性日益凸显。该旗舰会议的第19届将于2025年9月16日至21日在中国武汉举行。”,要求书写风格潦草。 ZH


Well done!


Unreadable text.


Almost totally failed.

Line Level

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
Please generate an image with handwritten text that says: "OpenCV is open source, contains over 2500 algorithms, and is operated by the non-profit Open Source Vision Foundation." The handwriting style should be scribbled. EN


Well done!

🤔
Partially correct but extra content is added.

🤔
Partially correct.


Almost totally failed.

Character (Font) Level

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
Please generate a handwritten character "P". EN


Requirements fulfilled.


Requirements fulfilled.
生成一个手写汉字“天”,风格任意 ZH


Requirements fulfilled.


Requirements fulfilled.


Totally failed.

Interleaved Image-Text

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
Generate a hand-drawn physics diagram illustrating the law of reflection:
1. A flat horizontal surface representing a mirror.
2. An incident ray approaching the surface at an angle, drawn with an arrow.
3. A reflected ray bouncing off the surface symmetrically, also with an arrow.
4. A normal line drawn perpendicular to the surface at the point of incidence.
5. Clear angle markings: the angle of incidence (labeled as θᵢ) and the angle of reflection (labeled as θᵣ)
6. Degree values annotated next to the angles (e.g., 45°).
7. Dashed lines used as angle guides (from rays to the normal).
8. All elements labeled with clean, handwriting-style text.
9. Overall style: hand-drawn, minimalistic, like a whiteboard or notebook sketch.
10. Background: plain white or paper texture; no photographic elements.
EN


Requirements fulfilled despite the vertical line shifts from the center.


Almost totally failed.

🤔
Partially fulfilled. Prompt is too long and truncated to 77 tokens.


Almost totally failed.

Text Editing

Page Level

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Add an embossed word that reads “Sun rises.” in the appropriate place. EN


Text is added but some text is cropped and image is cropped into a square format.


Text is added, but some content is missing.

Paragraph Level

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

请将文字“演讲的力量”修改为“讲话的力量”。其他文字保持不变 ZH

🤔
Partially correct. Modified successfully but the image becomes square and some texts are cropped.

🤔
Partially correct. Modified successfully but some texts are wrong.

Line Level

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Change "similarities" to "functionalities". EN

🤔
Partially correct. Modified successfully, but the image is squared, and some text is cropped. Clarity unexpectedly improve.

🤔
Partially correct. Modified successfully but most content is wrong.

Handwritten Text Removal

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

请擦除这张图片中所有的手写笔迹 ZH


Totally failed.


All things are removed.

将"高考加油鸭"这句话擦除 ZH

🤔
Successful removal. But the image is squared. Clarity unexpectedly improve.


All texts are removed.

Remove all handwritten text in this image. EN

🤔
Successful removal. But the image is squared. Drawings unexpectedly change.

🤔
Successful removal but the color and objects are changed.

Erase text "Football, cricket, running" in this image. EN


Text unedited. Light, drawings, and background color change.


Some content has been mistakenly removed, and certain text has become unreadable.

📷Scene Text Image

T2I Generation

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
Create a street sign image with text "Tomorrow". EN


Requirements fulfilled.


Requirements fulfilled.


Requirements fulfilled.

生成一个街上商店的招牌,内容是“超级市场”。 ZH


Requirements fulfilled.


Requirements fulfilled.


I don’t know what is this.

A bustling cyberpunk night market in a futuristic Asian metropolis, glowing with neon signs in multiple languages (Chinese, Japanese, Korean, Arabic, and English). The scene is filled with dense fog, reflections on wet pavement, flying cars above, and diverse crowds walking under neon umbrellas. Holographic advertisements float in the air, including a large glowing sign that reads "梦境集市" ("Dream Bazaar") in stylized Chinese calligraphy. Other floating signs display dynamic digital text such as "Now Open!", "未来食品", and "Quantum Noodles". The atmosphere is vibrant, chaotic, and immersive, with dramatic lighting and cinematic composition. Ultra-detailed, 4K, concept art style, with a blend of Blade Runner and Ghost in the Shell aesthetics. Mixed


Most requirements fulfilled. High quality. But some Chinese texts are incorrect or lack semantic.

🤔
Partially fulfilled. But some texts are incorrect or lack semantic.

🤔
Only style is correct. Text rendering failed.

Text Editing

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Change "2011" to "3120" and "MAPLES" to "LEAF". EN

🤔
Partially good. Successful removal. However, the image is squared and color is unexpectedly brightened.
<p align="center >


Totally failed.


Though the appearance remains the same, the texts are mistakenly edited.
<p align="center >

<p align="center >

<p align="center >

将“易侬”修改为“你好”,“生鲜”修改为“天天” ZH

<p align="center >

<p align="center >

<p align="center >

<p align="center >

<p align="center >

Scene Text Removal

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Erase the "BEACH" in this image. EN


Requirements fulfilled despite some details are lost.

🤔
Successful removal and original size maintainance. But texts are all removed.

🤔
Successful removal but notable traces. The image is unexpectedly squared.

Erase all text in this image. EN


Requirements fulfilled despite the image is squared.

🤔
Successful removal and original size maintainance. But the object is also removed.

🤔
Successful removal. The image is unexpectedly squared.

擦除文字“太古里”。 ZH

去除这张图片里所有的文字。 ZH

🔤Object with Naturally Embedded Text

T2I Generation

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
一张古董木制学生尺子的特写照片。英寸和厘米的刻度应该精确且可读,数字和线条由于年代久远而略有褪色。尺子放在一本打开的教科书上,书页上的文字清晰可辨。图片需要为长方形。 EN


The generated ruler appears structurally correct but has flawed measurement markings (incorrect spacing/numbering).


The generated ruler appears structurally correct but has unreadable measurement markings and texts.

生成一张高度细节化的老式机械键盘的图像,键帽磨损。键帽上的字符应该清晰可辨,准确地反映QWERTY布局。键盘应显示使用痕迹,有灰尘和轻微变色。背景是一个凌乱的木制桌子。 EN


The keyboard's overall structure is correctly generated, but exhibits missing keycaps and contains incorrect legends on some remaining keycaps.


The keyboard's overall structure is correctly generated but the texts on the keycaps are unreadable.

Generate a photorealistic smartwatch with a high-resolution display showing authentic embedded UI elements. Feature a sleek metallic casing with subtle branding and precisely labeled buttons. The active screen should display clear time, health metrics and notifications with pixel-perfect readability. Ensure all text appears naturally integrated into the interface without artificial overlays. Include realistic material details like screen reflections and slight wear marks. Render in ultra HD with professional lighting for maximum realism. EN


Most requirements fulfilled. High quality. But the brand SMRTWRCH may be incorrect.


The overall structure is correctly generated but the texts are unreadable.

生成一个饮料瓶,瓶身上印有中文品牌名、营养成分和生产日期,瓶身为透明塑料材质,有反光。 ZH

🤔
Most requirements fulfilled. Some Chinese texts lack semantic.


Result contains two bottles instead of one, and the content on the bottle surfaces is unreadable.

A smartphone back with the brand name 'TechFuture' subtly printed in a stylish font. The phone has a glossy finish and is reflecting light. EN


Requirements fulfilled. High quality.


The overall structure is correctly generated but the text is unreadable.

A bicycle computer showing the speed and distance traveled in a digital font. The display reads '25.5 km/h' and '15.2 km'. EN


Requirements fulfilled. High quality.


The overall structure is correctly generated but the text is wrong.

Text Editing

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Adjust the dashboard to show a speed of 60 km/h with the speedometer needle correctly positioned. Also, set the tachometer to a realistic RPM for that speed, like 2000 RPM, ensuring the vehicle's status appears consistent and accurate. Mixed

🤔
Partially good: the speed is correct at 60 km/h, but there are text errors, an incorrect speedometer needle, and additional unintended changes.

🤔
Partially good: the speed is correct at 60 km/h, but there are text errors, an incorrect speedometer needle, and additional unintended changes.

将0改成7,“冷藏”改成“风速”。 ZH

🤔
Partially good. The number is correctly modified while the Chinese text is not. Other text are not precisely retained.


The text isnot modified correctly, a large amount of additional content is changed, and some text is unreadable.

Modify "F5.6" to "OK.8" and "ONE" to "FOUR" EN

🤔
Partially good. Correct modification. But the image is accidentally squared.

🤔
Partially good. The text changes are correct, but a large amount of additional content has been modified.

🌈Artistic Text Image

T2I Generation

Line Level

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment utput Image Assessment Output Image Assessment utput Image Assessment Output Image Assessment
Generate a line of artistic text with intricate details, creative typography, and visual appeal, ensuring that each character has a different color. The font should have a unique aesthetic, incorporating elegant curves, bold strokes, or decorative elements. The text content should be: 'OpenCV is open source, contains over 2500 algorithms, and is operated by the non-profit Open Source Vision Foundation.' EN

🤔
Partially good. Some texts are incorrect.


Most of the text content is missing or incomplete.

生成一行具有复杂细节、创意排版和视觉吸引力的艺术文本,要求每一个文字的颜色都不相同,字体应具有独特的美感,融入优雅的曲线、粗犷的笔触或装饰元素。文本的内容为“生活就像海洋,只有意志坚强的人才能到达彼岸”。 ZH

🤔
Partially good. Some texts are incorrect.


Most of the text content is missing or incomplete.

生成一行具有复杂细节、创意排版和视觉吸引力的艺术文本,要求每一个文字的颜色都不相同,字体应具有独特的美感,融入优雅的曲线、粗犷的笔触或装饰元素。文本的内容为“龒厵䨫巴邑䶕脀勧忄”。 ZH


Totally failed. Unable to handle complex Chinese text.


Totally failed. Unable to handle complex Chinese text.

Character (Font) Level

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
Please generate an artistic font "A". EN


Requirements fulfilled.


Requirements fulfilled.

🤔
Partially correct.


Requirements fulfilled.
请生成一个艺术字,内容为“瀧”。 ZH


Totally failed. Unable to handle complex Chinese text.


Totally failed.


Totally failed.


Totally failed. Unable to handle Chinese text.

Text Editing

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Replace the "Thank You" text in the image with "Welcome Home" in the same watercolor style and green color. EN

Change the text "HAPPY NEW YEAR 2025" to "CELEBRATE LIFE 2025", and modify the style to a vintage retro
look with warm sepia tones while retaining the fireworks and cityscape background.
EN

将图片中的橙色文字替换为“秋日暖阳,使用相似的笔刷风格。 ZH

Style Transfer

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-Dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

参照图中的汉字风格,生成“一起去旅行”这句话 ZH


Requirements fulfilled.

🤔
Some texts are wrong.

参照图中的汉字风格,生成“一起去旅行”这句话 ZH


Requirements fulfilled.


Totally failed.

Refer to the text style of this image, create an image with text “You are welcome” EN


Requirements fulfilled.

🤔
Additional texts are generated.

🌌Slide Image

T2I Generation

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
A highly detailed and visually rich PowerPoint slide in a modern and professional style, featuring a bold English title at the top, multiple content blocks with varied font sizes including bullet points, short paragraphs, and highlighted keywords. The slide includes colorful icons, infographic-style illustrations, and a blend of clean vector graphics with hand-drawn sketch elements. A vertical sidebar shows a step-by-step process or timeline, and a small pie chart or data visualization is placed in one corner, labeled in English. The background is subtle, with a soft gradient or abstract texture that enhances readability without distraction. The overall layout is well-balanced, with clear structure, effective use of whitespace, and a harmonious color palette. The slide should appear as a fully finished presentation page with meaningful English content, refined typography, and polished visual composition. EN


Most requirements fulfilled.

🤔
Partially fulfilled. Some texts are blurred.

🤔
Partially correct but totally failed text rendering.
Generate a visually stunning and informative PowerPoint slide. The slide should be meticulously designed with a sophisticated layout, incorporating a diverse range of elements.
Text: Include well-written, concise English text in a professional font (e.g., Arial, Calibri, Times New Roman). The text should be logically organized and easy to read, with a clear title and supporting bullet points or short paragraphs.
Illustrations: Integrate intricate patterns, detailed drawings, and artistic paintings. These visual elements should be relevant to the text and enhance the overall message of the slide. Consider using a consistent color palette to create a harmonious aesthetic.
Layout: The slide should have a balanced and visually appealing layout. Experiment with different arrangements of text and images to create a dynamic and engaging design. Use whitespace effectively to avoid clutter.
Details: Pay attention to fine details such as shadows, gradients, and textures to add depth and realism to the image. The overall impression should be one of high quality and professionalism.
EN


Most requirements in the prompt are fulfilled.


Not a slide.

🤔
Partially correct object but totally failed text rendering.
一张视觉精美、信息丰富的长方形PPT幻灯片,主题为“未来科技与智能城市”。风格现代、科技感十足,整体排版清晰、专业,结构完整。幻灯片顶部是用中文写成的大标题“未来科技的城市图景”,使用无衬线字体,醒目现代。页面中部包含多个内容区域,展示有关智能交通系统、自动驾驶、物联网(IoT)、5G 网络基础设施等信息,每个部分配有简洁的中文段落说明和要点列表,如“智慧交通”、“数据中心”、“无人配送系统”等关键词以加粗或高亮方式呈现。页面中配有简洁清晰的图标、线条风格的插图、未来城市的建筑草图、以及科技设备的概念图。右下角是一个中文标注的数据图表(如柱状图或环形图)。背景为深蓝或渐变色调,带有抽象科技纹理。整体配色高对比,布局平衡有序,图文并茂,幻灯片应为完整内容,不能有留白或模板感。 ZH

🤔
Partially correct. Large text is good but smaller text is chaotic.

🤔
Partially correct. Smaller text is chaotic.

🤔
Partially correct but totally failed text rendering.

Text Editing

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Change "Document" to "Overleaf" and "Visual Question" to "Textual-based" EN

将“目标”修改为“关键”,“不需要标准答案”修改为“不用”(Change "目标" to "关键", "不需要标准答案" to "不用") ZH

🖼️Poster Image

T2I Generation

Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment
A stunning and informative Ocean Conservation poster featuring vibrant illustrations of marine life such as dolphins, turtles, colorful fish, and coral reefs, along with clean beaches and deep blue waves, with prominent, uplifting text like "Protect Our Oceans", "Save Marine Life", "Keep the Sea Plastic-Free", "Every Action Matters", and "Blue Planet, Bright Future", all arranged in a harmonious and visually rich design that inspires care and responsibility for the ocean. EN

A vibrant and lively Summer Music Festival poster filled with colorful illustrations of musical instruments, a cheering crowd, a bright stage with lights, palm trees, and summer decorations, featuring bold, eye-catching text such as "Summer Music Festival", "Live Bands", "Dance All Night", "July 15th", and "Join the Party", with an energetic layout and plenty of dynamic visual elements to create a festive and exciting atmosphere. EN

设计一张中国传统茶文化海报,包含文字“品茗雅集”、“传承千年茶道”、“每日新茶”、“龙井·碧螺春·铁观音”、“静心品茶 修身养性”、“营业时间:上午9点-晚上9点”。采用水墨画风格,淡雅的绿色和金色搭配,配以茶叶、茶具、竹子等元素,使用书法字体,整体布局古典优雅。 ZH

Text Editing

Input Image Prompt Lang. GPT-4o Qwen-VLo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Change "LOREM" to "spider", "ACID" to "RepT" EN

将“愚人节狂欢派对”改为“万圣节庆祝活动” ZH

🕌Layout-aware Text Generation

Input Image Prompt Lang. GPT-4o Qwen-vlo Flux.1-Kontext-dev OmniGen2 BAGEL Janus-4o
Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment Output Image Assessment

Add text “Good coffee” in appropriate position with layout awareness. EN


Text is correct but coffee’s position is changed. Objects are not preserved.


Requirements fulfilled.


Text is not correct. Image is squared.

Add text “Camera is good” in appropriate position with layout awareness. EN


Requirements fulfilled despite slight change on the text of camera.


Text is incorrect.


Text is not correct. Image is squared.

根据图片布局,在适当的位置添加文字“吃水果有益身体健康”。 ZH

根据图片布局,在透当的位置添加文字“这个蛋糕很好吃”。 ZH

📋Citation

If you find our work helpful, please cite our paper:

@article{zhang2025aesthetics,
  title={{Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR}},
  author={Zhang, Peirong and Xu, Haowei and Zhang, Jiaxin and Xu, Guitao and Zheng, Xuhan and Yang, Zhenhua and Liu, Junle and Zhang, Yuyi and Jin, Lianwen},
  journal={arXiv preprint arXiv:2507.15085},
  year={2025}
}

📧Contact

[email protected]

🌊Acknowledgement

Copyright 2025, Deep Learning and Vision Computing (DLVC) Lab, South China China University of Technology.

⭐Star History

Star Rising

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published